[2025-11-12 22:10:02,029][mllm.models.large_language_model_local][INFO] - Initializing adapter 'agent_adapter': no initial weights provided or found; starting from scratch. [2025-11-12 22:10:02,821][mllm.models.adapter_training_wrapper][INFO] - Adapter 'agent_adapter': initialized with fresh weights (no initial weights found). [2025-11-12 22:10:02,829][mllm.models.large_language_model_local][INFO] - Initializing adapter 'critic_adapter': no initial weights provided or found; starting from scratch. [2025-11-12 22:10:03,960][mllm.models.adapter_training_wrapper][INFO] - Adapter 'critic_adapter': initialized with fresh weights (no initial weights found). [2025-11-12 22:12:14,996][__main__][INFO] - Starting iteration 0. [2025-11-12 22:12:15,000][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-12 22:12:15,000][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:12:17,956][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:12:20,512][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:12:28,833][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 20 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:12:29,734][__main__][INFO] - Number of regex retries in iteration 0: 3 [2025-11-12 22:12:29,734][__main__][INFO] - agents played in iteration 0 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:12:42,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-12 22:12:42,304][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-12 22:12:42,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-12 22:12:42,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 37.45%, Block Peak % of device VRAM: 18.68%, ΔTime: 00:00:00 [2025-11-12 22:12:42,353][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:12:42,354][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:12:42,939][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:12:43,677][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:12:44,191][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:12:44,692][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:12:45,203][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:12:45,697][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:12:46,192][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:12:46,689][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:12:47,183][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:12:47,684][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:12:48,181][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:12:48,683][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:12:49,180][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:12:49,675][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:12:50,179][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:12:50,686][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:12:51,182][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:12:51,678][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:12:52,177][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:12:52,671][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:12:53,171][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:12:53,700][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:12:54,205][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:12:54,702][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:12:55,198][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:12:55,709][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:12:56,207][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:12:56,704][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:12:57,205][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:12:57,704][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:12:58,210][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:12:58,707][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:12:59,204][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:12:59,717][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:13:00,217][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:13:00,716][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:13:01,218][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:13:01,718][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:13:02,221][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:13:02,716][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:13:03,211][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:13:03,708][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:13:04,206][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:13:04,702][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:13:05,199][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:13:05,693][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:13:06,192][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:13:06,689][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:13:07,194][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:13:07,697][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:13:08,190][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:13:08,689][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:13:09,189][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:13:09,687][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:13:10,190][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:13:10,685][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:13:11,181][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:13:11,691][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:13:12,185][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:13:12,682][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:13:13,178][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:13:13,672][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:13:14,177][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:13:14,673][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:13:15,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9707 tokens. [2025-11-12 22:13:15,848][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 15.75%, Current % of VRAM taken: 53.20%, Block Peak % of device VRAM: 61.05%, ΔTime: 00:00:32 [2025-11-12 22:13:16,534][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:13:16,536][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:13:16,537][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:13:17,592][__main__][INFO] - Iteration 1 took 1m 2s (23.54% Gen, 74.77% Train). Generation: 14s, Training: 46s. Estimated remaining time: 52h 6m 3s. Estimated total time: 52h 9m 39s. Time estimates for 10 more iterations: 10m 25s, 100 more iterations: 1h 44m 19s, 500 more iterations: 8h 41m 36s. [2025-11-12 22:13:17,595][__main__][INFO] - Starting iteration 1. [2025-11-12 22:13:18,047][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-12 22:13:18,047][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:13:21,019][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:13:28,300][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls Given that we have low per-item values for hats (1), and Bob has a high per-item value for hats (10), it's not beneficial to propose any hats. Similarly, while we have higher per-item values for books (10) and balls (10), Bob has a very low per-item value for balls (1), making it more advantageous to propose for books instead. Since proposing 10 hats would fully allocate them (as they are 10 in quantity), and there's no chance of us needing to share them in a proportional manner, we propose to take all 10 hats and none of the books or balls in this round. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:13:32,835][__main__][INFO] - Number of regex retries in iteration 1: 2 [2025-11-12 22:13:32,835][__main__][INFO] - agents played in iteration 1 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:13:33,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 46.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:13:33,785][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 46.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:13:33,812][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 46.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:13:33,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 46.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:13:33,837][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:13:33,838][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:13:34,463][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:13:34,929][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:13:35,430][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:13:35,973][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:13:36,466][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:13:36,963][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:13:37,456][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:13:37,950][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:13:38,447][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:13:38,942][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:13:39,434][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:13:39,929][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:13:40,424][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:13:40,917][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:13:41,410][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:13:41,905][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:13:42,396][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:13:42,890][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:13:43,382][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:13:43,875][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:13:44,367][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:13:44,858][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:13:45,352][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:13:45,845][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:13:46,338][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:13:46,830][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:13:47,323][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:13:47,815][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:13:48,308][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:13:48,801][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:13:49,296][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:13:49,788][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:13:50,281][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:13:50,773][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:13:51,265][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:13:51,757][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:13:52,249][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:13:52,741][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:13:53,233][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:13:53,730][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:13:54,226][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:13:54,719][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:13:55,213][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:13:55,707][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:13:56,199][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:13:56,693][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:13:57,187][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:13:57,679][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:13:58,172][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:13:58,665][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:13:59,158][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:13:59,653][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:14:00,148][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:14:00,644][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:14:01,142][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:14:01,637][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:14:02,131][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:14:02,626][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:14:03,121][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:14:03,623][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:14:04,125][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:14:04,621][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:14:05,137][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:14:05,639][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:14:06,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9668 tokens. [2025-11-12 22:14:06,816][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.01%, Current % of VRAM taken: 58.25%, Block Peak % of device VRAM: 61.69%, ΔTime: 00:00:32 [2025-11-12 22:14:07,549][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:14:07,551][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:14:07,552][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:14:08,568][__main__][INFO] - Iteration 2 took 50s (29.26% Gen, 68.69% Train). Generation: 14s, Training: 34s. Estimated remaining time: 42h 2m 41s. Estimated total time: 42h 7m 8s. Time estimates for 10 more iterations: 8m 25s, 100 more iterations: 1h 24m 14s, 500 more iterations: 7h 1m 11s. [2025-11-12 22:14:08,570][__main__][INFO] - Starting iteration 2. [2025-11-12 22:14:09,039][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-12 22:14:09,041][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:14:09,742][mllm.models.large_language_model_local][WARNING] - Response Proposal: 5 hats, 0 books, 5 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:14:11,517][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 5 books, 5 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:14:16,850][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:14:22,457][__main__][INFO] - Number of regex retries in iteration 2: 3 [2025-11-12 22:14:22,457][__main__][INFO] - agents played in iteration 2 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:14:23,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:14:23,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:14:23,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:14:23,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:14:23,416][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:14:23,417][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:14:24,037][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:14:24,493][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:14:24,993][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:14:25,489][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:14:25,987][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:14:26,485][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:14:26,982][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:14:27,479][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:14:27,973][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:14:28,470][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:14:28,970][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:14:29,473][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:14:29,972][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:14:30,469][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:14:30,970][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:14:31,468][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:14:31,968][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:14:32,472][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:14:32,976][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:14:33,474][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:14:33,975][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:14:34,474][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:14:34,996][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:14:35,507][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:14:36,006][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:14:36,511][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:14:37,004][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:14:37,499][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:14:38,001][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:14:38,494][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:14:38,989][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:14:39,485][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:14:39,978][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:14:40,473][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:14:40,967][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:14:41,462][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:14:41,956][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:14:42,450][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:14:42,949][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:14:43,442][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:14:43,937][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:14:44,445][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:14:44,937][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:14:45,432][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:14:45,929][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:14:46,422][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:14:46,928][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:14:47,423][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:14:47,915][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:14:48,413][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:14:48,907][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:14:49,401][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:14:49,897][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:14:50,391][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:14:50,886][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:14:51,381][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:14:51,876][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:14:52,372][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:14:52,867][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:14:53,370][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:14:53,870][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:14:54,371][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:14:54,871][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:14:55,371][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:14:55,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9639 tokens. [2025-11-12 22:14:56,545][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.02%, Current % of VRAM taken: 58.27%, Block Peak % of device VRAM: 61.67%, ΔTime: 00:00:32 [2025-11-12 22:14:57,272][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:14:57,274][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:14:57,277][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:14:58,222][__main__][INFO] - Iteration 3 took 49s (27.28% Gen, 70.80% Train). Generation: 13s, Training: 34s. Estimated remaining time: 40h 53m 54s. Estimated total time: 40h 59m 10s. Time estimates for 10 more iterations: 8m 11s, 100 more iterations: 1h 21m 58s, 500 more iterations: 6h 49m 51s. [2025-11-12 22:14:58,225][__main__][INFO] - Starting iteration 3. [2025-11-12 22:14:58,664][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-12 22:14:58,665][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:15:11,483][__main__][INFO] - Number of regex retries in iteration 3: 0 [2025-11-12 22:15:11,484][__main__][INFO] - agents played in iteration 3 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:15:12,360][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:15:12,387][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:15:12,413][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:15:12,436][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:15:12,437][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:15:12,438][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:15:13,070][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:15:13,540][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:15:14,048][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:15:14,545][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:15:15,043][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:15:15,544][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:15:16,047][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:15:16,543][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:15:17,039][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:15:17,548][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:15:18,046][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:15:18,542][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:15:19,036][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:15:19,531][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:15:20,042][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:15:20,536][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:15:21,032][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:15:21,530][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:15:22,025][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:15:22,523][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:15:23,020][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:15:23,515][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:15:24,021][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:15:24,516][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:15:25,010][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:15:25,502][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:15:25,996][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:15:26,493][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:15:26,988][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:15:27,483][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:15:27,980][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:15:28,474][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:15:28,973][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:15:29,467][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:15:29,963][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:15:30,463][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:15:30,958][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:15:31,454][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:15:31,949][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:15:32,443][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:15:32,947][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:15:33,444][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:15:33,940][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:15:34,437][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:15:34,934][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:15:35,430][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:15:35,932][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:15:36,429][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:15:36,938][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:15:37,434][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:15:37,929][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:15:38,431][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:15:38,928][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:15:39,427][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:15:39,922][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:15:40,416][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:15:40,914][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:15:41,408][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:15:41,902][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:15:42,397][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:15:42,893][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:15:43,391][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:15:43,890][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:15:44,387][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:15:44,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9693 tokens. [2025-11-12 22:15:45,572][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.91%, Current % of VRAM taken: 58.16%, Block Peak % of device VRAM: 61.55%, ΔTime: 00:00:32 [2025-11-12 22:15:46,308][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:15:46,310][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:15:46,312][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:15:47,248][__main__][INFO] - Iteration 4 took 48s (26.39% Gen, 71.69% Train). Generation: 12s, Training: 34s. Estimated remaining time: 40h 23m 6s. Estimated total time: 40h 29m 11s. Time estimates for 10 more iterations: 8m 5s, 100 more iterations: 1h 20m 58s, 500 more iterations: 6h 44m 51s. [2025-11-12 22:15:47,250][__main__][INFO] - Starting iteration 4. [2025-11-12 22:15:47,696][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-12 22:15:47,697][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:16:00,612][__main__][INFO] - Number of regex retries in iteration 4: 0 [2025-11-12 22:16:00,613][__main__][INFO] - agents played in iteration 4 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:16:01,479][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.23%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:16:01,509][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.23%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:16:01,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.23%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:16:01,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.23%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:16:01,562][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:16:01,563][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:16:02,212][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:16:02,664][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:16:03,162][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:16:03,657][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:16:04,153][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:16:04,646][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:16:05,142][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:16:05,654][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:16:06,148][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:16:06,645][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:16:07,160][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:16:07,656][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:16:08,151][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:16:08,649][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:16:09,144][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:16:09,655][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:16:10,150][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:16:10,645][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:16:11,141][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:16:11,636][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:16:12,135][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:16:12,629][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:16:13,122][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:16:13,616][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:16:14,110][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:16:14,603][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:16:15,101][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:16:15,594][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:16:16,090][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:16:16,584][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:16:17,077][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:16:17,577][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:16:18,072][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:16:18,574][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:16:19,068][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:16:19,564][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:16:20,064][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:16:20,561][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:16:21,078][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:16:21,574][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:16:22,070][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:16:22,569][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:16:23,064][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:16:23,559][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:16:24,063][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:16:24,558][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:16:25,056][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:16:25,553][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:16:26,053][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:16:26,566][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:16:27,059][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:16:27,551][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:16:28,047][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:16:28,541][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:16:29,041][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:16:29,534][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:16:30,027][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:16:30,534][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:16:31,029][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:16:31,528][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:16:32,026][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:16:32,523][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:16:33,021][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:16:33,520][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:16:34,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9670 tokens. [2025-11-12 22:16:34,741][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.00%, Current % of VRAM taken: 58.25%, Block Peak % of device VRAM: 61.60%, ΔTime: 00:00:32 [2025-11-12 22:16:35,524][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:16:35,526][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:16:35,528][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:16:36,428][__main__][INFO] - Iteration 5 took 48s (26.50% Gen, 71.65% Train). Generation: 12s, Training: 34s. Estimated remaining time: 40h 29m 43s. Estimated total time: 40h 36m 38s. Time estimates for 10 more iterations: 8m 7s, 100 more iterations: 1h 21m 13s, 500 more iterations: 6h 46m 6s. [2025-11-12 22:16:36,430][__main__][INFO] - Starting iteration 5. [2025-11-12 22:16:36,876][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-12 22:16:36,877][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:16:50,343][__main__][INFO] - Number of regex retries in iteration 5: 0 [2025-11-12 22:16:50,343][__main__][INFO] - agents played in iteration 5 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:16:51,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:16:51,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:16:51,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:16:51,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:16:51,311][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:16:51,312][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:16:51,963][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:16:52,422][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:16:52,927][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:16:53,427][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:16:53,922][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:16:54,422][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:16:54,917][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:16:55,411][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:16:55,910][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:16:56,405][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:16:56,899][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:16:57,393][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:16:57,888][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:16:58,387][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:16:58,881][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:16:59,374][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:16:59,869][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:17:00,363][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:17:00,865][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:17:01,364][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:17:01,858][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:17:02,355][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:17:02,851][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:17:03,347][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:17:03,844][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:17:04,340][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:17:04,843][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:17:05,339][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:17:05,836][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:17:06,358][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:17:06,853][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:17:07,352][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:17:07,847][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:17:08,341][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:17:08,837][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:17:09,330][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:17:09,825][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:17:10,320][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:17:10,814][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:17:11,310][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:17:11,803][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:17:12,297][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:17:12,792][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:17:13,287][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:17:13,781][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:17:14,276][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:17:14,770][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:17:15,265][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:17:15,758][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:17:16,251][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:17:16,755][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:17:17,249][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:17:17,742][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:17:18,248][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:17:18,741][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:17:19,233][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:17:19,728][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:17:20,222][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:17:20,723][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:17:21,224][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:17:21,722][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:17:22,219][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:17:22,720][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:17:23,218][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:17:23,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9769 tokens. [2025-11-12 22:17:24,410][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.01%, Current % of VRAM taken: 58.25%, Block Peak % of device VRAM: 61.67%, ΔTime: 00:00:32 [2025-11-12 22:17:25,147][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:17:25,149][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:17:25,151][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:17:26,053][__main__][INFO] - Iteration 6 took 49s (27.38% Gen, 70.78% Train). Generation: 13s, Training: 34s. Estimated remaining time: 40h 51m 7s. Estimated total time: 40h 58m 52s. Time estimates for 10 more iterations: 8m 11s, 100 more iterations: 1h 21m 57s, 500 more iterations: 6h 49m 48s. [2025-11-12 22:17:26,055][__main__][INFO] - Starting iteration 6. [2025-11-12 22:17:26,574][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-12 22:17:26,574][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:17:33,099][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:17:38,885][__main__][INFO] - Number of regex retries in iteration 6: 1 [2025-11-12 22:17:38,886][__main__][INFO] - agents played in iteration 6 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:17:39,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:17:39,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:17:39,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:17:39,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:17:39,944][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:17:39,944][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:17:40,640][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:17:41,097][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:17:41,604][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:17:42,099][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:17:42,595][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:17:43,092][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:17:43,590][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:17:44,088][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:17:44,586][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:17:45,081][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:17:45,588][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:17:46,081][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:17:46,575][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:17:47,071][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:17:47,566][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:17:48,072][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:17:48,567][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:17:49,063][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:17:49,575][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:17:50,072][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:17:50,566][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:17:51,065][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:17:51,560][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:17:52,059][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:17:52,555][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:17:53,050][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:17:53,547][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:17:54,043][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:17:54,541][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:17:55,037][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:17:55,533][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:17:56,032][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:17:56,529][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:17:57,025][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:17:57,523][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:17:58,019][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:17:58,517][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:17:59,018][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:17:59,513][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:18:00,014][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:18:00,513][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:18:01,009][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:18:01,505][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:18:01,998][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:18:02,492][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:18:02,985][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:18:03,479][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:18:03,987][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:18:04,480][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:18:04,974][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:18:05,468][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:18:05,961][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:18:06,456][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:18:06,949][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:18:07,443][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:18:07,937][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:18:08,431][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:18:08,931][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:18:09,425][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:18:09,919][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:18:10,418][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:18:10,916][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:18:11,411][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:18:11,913][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:18:12,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9673 tokens. [2025-11-12 22:18:13,112][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.99%, Current % of VRAM taken: 58.24%, Block Peak % of device VRAM: 61.57%, ΔTime: 00:00:32 [2025-11-12 22:18:13,836][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:18:13,838][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:18:13,839][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:18:14,774][__main__][INFO] - Iteration 7 took 48s (25.54% Gen, 72.52% Train). Generation: 12s, Training: 34s. Estimated remaining time: 40h 1m 29s. Estimated total time: 40h 10m 2s. Time estimates for 10 more iterations: 8m 2s, 100 more iterations: 1h 20m 20s, 500 more iterations: 6h 41m 40s. [2025-11-12 22:18:14,776][__main__][INFO] - Starting iteration 7. [2025-11-12 22:18:15,233][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-12 22:18:15,233][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:18:27,975][__main__][INFO] - Number of regex retries in iteration 7: 0 [2025-11-12 22:18:27,975][__main__][INFO] - agents played in iteration 7 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:18:28,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:18:28,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:18:28,904][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:18:28,926][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:18:28,926][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:18:28,927][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:18:29,615][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:18:30,076][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:18:30,580][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:18:31,076][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:18:31,581][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:18:32,079][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:18:32,573][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:18:33,074][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:18:33,569][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:18:34,063][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:18:34,559][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:18:35,053][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:18:35,549][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:18:36,045][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:18:36,541][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:18:37,037][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:18:37,534][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:18:38,030][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:18:38,525][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:18:39,020][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:18:39,517][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:18:40,013][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:18:40,510][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:18:41,005][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:18:41,500][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:18:41,999][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:18:42,497][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:18:42,991][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:18:43,500][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:18:43,997][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:18:44,494][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:18:44,989][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:18:45,484][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:18:45,989][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:18:46,483][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:18:46,977][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:18:47,473][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:18:47,970][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:18:48,466][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:18:48,972][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:18:49,468][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:18:49,966][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:18:50,462][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:18:50,957][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:18:51,455][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:18:51,950][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:18:52,446][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:18:52,940][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:18:53,438][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:18:53,935][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:18:54,431][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:18:54,928][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:18:55,428][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:18:55,924][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:18:56,421][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:18:56,918][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:18:57,413][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:18:57,911][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:18:58,409][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:18:58,909][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:18:59,409][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:18:59,907][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:19:00,408][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:19:00,904][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:19:01,403][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9678 tokens. [2025-11-12 22:19:02,106][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.95%, Current % of VRAM taken: 58.20%, Block Peak % of device VRAM: 61.62%, ΔTime: 00:00:32 [2025-11-12 22:19:02,862][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:19:02,864][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:19:02,866][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:19:03,783][__main__][INFO] - Iteration 8 took 48s (26.24% Gen, 71.86% Train). Generation: 12s, Training: 34s. Estimated remaining time: 40h 18m 9s. Estimated total time: 40h 27m 31s. Time estimates for 10 more iterations: 8m 5s, 100 more iterations: 1h 20m 55s, 500 more iterations: 6h 44m 35s. [2025-11-12 22:19:03,785][__main__][INFO] - Starting iteration 8. [2025-11-12 22:19:04,242][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-12 22:19:04,243][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:19:16,992][__main__][INFO] - Number of regex retries in iteration 8: 0 [2025-11-12 22:19:16,993][__main__][INFO] - agents played in iteration 8 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:19:17,911][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:19:17,938][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:19:17,965][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:19:17,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:19:17,989][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:19:17,990][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:19:18,643][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:19:19,102][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:19:19,601][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:19:20,111][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:19:20,612][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:19:21,110][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:19:21,601][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:19:22,099][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:19:22,605][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:19:23,099][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:19:23,592][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:19:24,108][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:19:24,604][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:19:25,098][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:19:25,594][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:19:26,088][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:19:26,582][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:19:27,075][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:19:27,570][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:19:28,075][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:19:28,568][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:19:29,079][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:19:29,574][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:19:30,068][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:19:30,574][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:19:31,068][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:19:31,562][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:19:32,054][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:19:32,547][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:19:33,040][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:19:33,533][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:19:34,026][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:19:34,523][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:19:35,021][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:19:35,516][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:19:36,010][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:19:36,508][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:19:37,013][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:19:37,507][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:19:38,000][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:19:38,495][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:19:38,988][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:19:39,495][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:19:39,995][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:19:40,490][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:19:40,986][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:19:41,479][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:19:41,972][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:19:42,463][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:19:42,956][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:19:43,451][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:19:43,945][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:19:44,440][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:19:44,934][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:19:45,427][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:19:45,918][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:19:46,413][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:19:46,909][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:19:47,416][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:19:47,915][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:19:48,413][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:19:48,912][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:19:49,409][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:19:49,915][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:19:50,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9550 tokens. [2025-11-12 22:19:51,142][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.00%, Current % of VRAM taken: 58.24%, Block Peak % of device VRAM: 61.68%, ΔTime: 00:00:32 [2025-11-12 22:19:51,891][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:19:51,893][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:19:51,894][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:19:52,837][__main__][INFO] - Iteration 9 took 48s (26.24% Gen, 71.82% Train). Generation: 12s, Training: 34s. Estimated remaining time: 40h 19m 34s. Estimated total time: 40h 29m 45s. Time estimates for 10 more iterations: 8m 5s, 100 more iterations: 1h 20m 59s, 500 more iterations: 6h 44m 57s. [2025-11-12 22:19:52,839][__main__][INFO] - Starting iteration 9. [2025-11-12 22:19:53,343][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-12 22:19:53,344][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:20:06,619][__main__][INFO] - Number of regex retries in iteration 9: 0 [2025-11-12 22:20:06,620][__main__][INFO] - agents played in iteration 9 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:20:07,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:20:07,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:20:07,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:20:07,747][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:20:07,748][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:20:07,749][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:20:08,443][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:20:08,902][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:20:09,414][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:20:09,909][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:20:10,404][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:20:10,905][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:20:11,401][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:20:11,898][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:20:12,397][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:20:12,893][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:20:13,411][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:20:13,910][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:20:14,404][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:20:14,900][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:20:15,394][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:20:15,895][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:20:16,390][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:20:16,884][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:20:17,378][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:20:17,871][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:20:18,364][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:20:18,857][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:20:19,350][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:20:19,846][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:20:20,340][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:20:20,834][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:20:21,332][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:20:21,825][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:20:22,320][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:20:22,813][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:20:23,308][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:20:23,819][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:20:24,314][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:20:24,809][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:20:25,303][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:20:25,797][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:20:26,302][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:20:26,796][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:20:27,290][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:20:27,799][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:20:28,293][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:20:28,788][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:20:29,290][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:20:29,785][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:20:30,291][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:20:30,788][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:20:31,283][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:20:31,790][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:20:32,284][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:20:32,777][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:20:33,270][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:20:33,763][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:20:34,258][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:20:34,755][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:20:35,250][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:20:35,745][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:20:36,239][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:20:36,739][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:20:37,238][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:20:37,736][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:20:38,235][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:20:38,732][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:20:39,226][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:20:39,725][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:20:40,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9682 tokens. [2025-11-12 22:20:40,930][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.94%, Current % of VRAM taken: 58.19%, Block Peak % of device VRAM: 61.66%, ΔTime: 00:00:32 [2025-11-12 22:20:41,669][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:20:41,671][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:20:41,672][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:20:42,618][__main__][INFO] - Iteration 10 took 49s (26.94% Gen, 71.14% Train). Generation: 13s, Training: 35s. Estimated remaining time: 40h 52m 46s. Estimated total time: 41h 3m 47s. Time estimates for 10 more iterations: 8m 12s, 100 more iterations: 1h 22m 7s, 500 more iterations: 6h 50m 37s. [2025-11-12 22:20:42,621][__main__][INFO] - Starting iteration 10. [2025-11-12 22:20:43,101][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2025-11-12 22:20:43,102][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:20:43,772][mllm.models.large_language_model_local][WARNING] - Response Proposal: 5 hats, 5 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:20:47,830][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls Given Alice's per-item values, she values hats much less than I do, and books and balls less than I do. I should take all the hats to maximize my points from this round. Since my value for hats is 10 and Alice's is 1, I can secure the maximum points for this round by taking all the hats and nothing else. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:20:51,513][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls Given that Bob values hats significantly more (10) compared to your own valuation (1), and similarly, Bob values books and balls equally (1). This suggests that Bob would likely propose to take all the hats, books, and balls to maximize his points. Therefore, proposing to take all 10 hats is a strategic move to seize the highest value items before Bob can propose for them. This proposal ensures that you get the maximum possible points from the hats, which is your most valuable item, assuming Bob will not allocate hats to you if he proposes to take all of them. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:20:57,473][__main__][INFO] - Number of regex retries in iteration 10: 3 [2025-11-12 22:20:57,474][__main__][INFO] - agents played in iteration 10 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:20:58,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.24%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:20:58,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.24%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:20:58,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.24%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:20:58,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.24%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:20:58,403][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:20:58,404][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:20:59,029][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:20:59,482][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:20:59,983][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:21:00,477][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:21:00,972][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:21:01,466][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:21:01,961][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:21:02,455][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:21:02,949][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:21:03,450][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:21:03,946][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:21:04,449][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:21:04,948][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:21:05,445][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:21:05,945][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:21:06,440][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:21:06,940][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:21:07,437][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:21:07,933][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:21:08,430][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:21:08,928][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:21:09,426][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:21:09,922][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:21:10,416][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:21:10,934][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:21:11,429][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:21:11,924][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:21:12,437][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:21:12,933][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:21:13,428][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:21:13,925][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:21:14,420][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:21:14,918][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:21:15,418][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:21:15,916][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:21:16,434][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:21:16,933][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:21:17,432][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:21:17,929][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:21:18,424][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:21:18,939][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:21:19,438][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:21:19,932][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:21:20,430][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:21:20,927][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:21:21,430][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:21:21,926][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:21:22,422][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:21:22,932][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:21:23,426][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:21:23,920][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:21:24,418][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:21:24,914][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:21:25,422][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:21:25,916][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:21:26,416][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:21:26,933][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:21:27,431][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:21:27,933][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:21:28,433][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:21:28,933][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:21:29,443][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:21:29,944][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:21:30,445][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:21:30,951][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9642 tokens. [2025-11-12 22:21:31,643][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.90%, Current % of VRAM taken: 58.14%, Block Peak % of device VRAM: 61.58%, ΔTime: 00:00:32 [2025-11-12 22:21:32,400][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:21:32,402][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:21:32,404][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:21:34,228][__main__][INFO] - Iteration 11 took 51s (28.11% Gen, 68.32% Train). Generation: 14s, Training: 34s. Estimated remaining time: 42h 24m 30s. Estimated total time: 42h 36m 22s. Time estimates for 10 more iterations: 8m 31s, 100 more iterations: 1h 25m 12s, 500 more iterations: 7h 6m 3s. [2025-11-12 22:21:34,231][__main__][INFO] - Starting iteration 11. [2025-11-12 22:21:34,681][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-12 22:21:34,681][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:21:36,370][mllm.models.large_language_model_local][WARNING] - Response Proposal: 5 hats, 5 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:21:41,157][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls Given the per-item values, I see that I value hats at 1, books at 10, and balls at 10. Since Bob values books and balls very highly, I propose keeping all the hats to avoid any risk of items being allocated to him proportionally, which could significantly reduce my points if he proposes higher quantities for books and balls. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:21:49,983][__main__][INFO] - Number of regex retries in iteration 11: 2 [2025-11-12 22:21:49,984][__main__][INFO] - agents played in iteration 11 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:21:50,780][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.20%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:21:50,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.20%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:21:50,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.20%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:21:50,846][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.20%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:21:50,846][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:21:50,848][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:21:51,434][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:21:51,892][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:21:52,390][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:21:52,885][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:21:53,383][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:21:53,878][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:21:54,374][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:21:54,867][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:21:55,362][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:21:55,939][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:21:56,445][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:21:56,945][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:21:57,462][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:21:57,958][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:21:58,457][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:21:58,954][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:21:59,454][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:21:59,952][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:22:00,449][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:22:00,948][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:22:01,443][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:22:01,941][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:22:02,439][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:22:02,934][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:22:03,431][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:22:03,928][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:22:04,424][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:22:04,933][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:22:05,429][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:22:05,925][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:22:06,421][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:22:06,918][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:22:07,425][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:22:07,920][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:22:08,415][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:22:08,914][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:22:09,407][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:22:09,901][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:22:10,408][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:22:10,903][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:22:11,421][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:22:11,916][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:22:12,411][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:22:12,913][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:22:13,407][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:22:13,906][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:22:14,401][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:22:14,894][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:22:15,389][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:22:15,882][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:22:16,376][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:22:16,871][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:22:17,367][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:22:17,866][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:22:18,361][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:22:18,858][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:22:19,357][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:22:19,852][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:22:20,351][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:22:20,851][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:22:21,352][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:22:21,856][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:22:22,355][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:22:22,855][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:22:23,354][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9655 tokens. [2025-11-12 22:22:24,050][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 12.97%, Current % of VRAM taken: 57.22%, Block Peak % of device VRAM: 61.65%, ΔTime: 00:00:32 [2025-11-12 22:22:24,812][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:22:24,814][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:22:24,816][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:22:25,719][__main__][INFO] - Iteration 12 took 51s (29.98% Gen, 68.25% Train). Generation: 15s, Training: 34s. Estimated remaining time: 42h 19m 12s. Estimated total time: 42h 31m 56s. Time estimates for 10 more iterations: 8m 30s, 100 more iterations: 1h 25m 3s, 500 more iterations: 7h 5m 19s. [2025-11-12 22:22:25,721][__main__][INFO] - Starting iteration 12. [2025-11-12 22:22:26,166][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-12 22:22:26,167][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:22:40,018][__main__][INFO] - Number of regex retries in iteration 12: 0 [2025-11-12 22:22:40,019][__main__][INFO] - agents played in iteration 12 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:22:40,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 50.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:22:40,843][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 50.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:22:40,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 50.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:22:40,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 50.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:22:40,892][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:22:40,893][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:22:41,490][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:22:41,945][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:22:42,446][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:22:42,950][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:22:43,444][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:22:43,940][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:22:44,455][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:22:44,951][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:22:45,449][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:22:45,946][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:22:46,444][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:22:46,957][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:22:47,455][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:22:47,952][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:22:48,447][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:22:48,942][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:22:49,446][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:22:49,941][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:22:50,435][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:22:50,933][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:22:51,427][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:22:51,921][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:22:52,414][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:22:52,909][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:22:53,407][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:22:53,900][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:22:54,395][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:22:54,900][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:22:55,397][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:22:55,892][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:22:56,393][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:22:56,888][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:22:57,385][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:22:57,881][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:22:58,376][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:22:58,873][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:22:59,373][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:22:59,872][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:23:00,368][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:23:00,866][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:23:01,366][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:23:01,860][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:23:02,357][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:23:02,852][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:23:03,347][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:23:03,849][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:23:04,345][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:23:04,839][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:23:05,340][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:23:05,835][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:23:06,330][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:23:06,825][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:23:07,319][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:23:07,844][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:23:08,341][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:23:08,839][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:23:09,341][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:23:09,841][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:23:10,349][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:23:10,847][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:23:11,345][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:23:11,846][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:23:12,346][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:23:12,845][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:23:13,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9670 tokens. [2025-11-12 22:23:14,074][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.94%, Current % of VRAM taken: 58.18%, Block Peak % of device VRAM: 61.70%, ΔTime: 00:00:32 [2025-11-12 22:23:14,808][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:23:14,809][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:23:14,811][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:23:15,681][__main__][INFO] - Iteration 13 took 49s (27.97% Gen, 70.27% Train). Generation: 13s, Training: 34s. Estimated remaining time: 41h 2m 13s. Estimated total time: 41h 15m 47s. Time estimates for 10 more iterations: 8m 15s, 100 more iterations: 1h 22m 31s, 500 more iterations: 6h 52m 37s. [2025-11-12 22:23:15,683][__main__][INFO] - Starting iteration 13. [2025-11-12 22:23:16,157][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-12 22:23:16,158][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:23:30,250][__main__][INFO] - Number of regex retries in iteration 13: 0 [2025-11-12 22:23:30,251][__main__][INFO] - agents played in iteration 13 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:23:31,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.24%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:23:31,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.24%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:23:31,201][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.24%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:23:31,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.24%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:23:31,224][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:23:31,225][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:23:31,862][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:23:32,316][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:23:32,817][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:23:33,312][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:23:33,812][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:23:34,316][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:23:34,812][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:23:35,307][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:23:35,812][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:23:36,309][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:23:36,807][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:23:37,307][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:23:37,826][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:23:38,336][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:23:38,834][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:23:39,345][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:23:39,843][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:23:40,339][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:23:40,854][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:23:41,351][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:23:41,850][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:23:42,346][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:23:42,841][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:23:43,357][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:23:43,853][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:23:44,347][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:23:44,841][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:23:45,341][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:23:45,851][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:23:46,348][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:23:46,843][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:23:47,357][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:23:47,857][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:23:48,351][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:23:48,845][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:23:49,339][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:23:49,856][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:23:50,349][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:23:50,842][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:23:51,337][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:23:51,831][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:23:52,330][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:23:52,823][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:23:53,317][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:23:53,815][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:23:54,313][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:23:54,810][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:23:55,307][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:23:55,801][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:23:56,297][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:23:56,794][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:23:57,289][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:23:57,786][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:23:58,280][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:23:58,776][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:23:59,271][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:23:59,766][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:24:00,272][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:24:00,771][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:24:01,271][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:24:01,770][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:24:02,268][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:24:02,768][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:24:03,269][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:24:03,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9675 tokens. [2025-11-12 22:24:04,479][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 12.98%, Current % of VRAM taken: 57.22%, Block Peak % of device VRAM: 61.66%, ΔTime: 00:00:32 [2025-11-12 22:24:05,205][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:24:05,207][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:24:05,208][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:24:06,115][__main__][INFO] - Iteration 14 took 49s (28.21% Gen, 69.97% Train). Generation: 14s, Training: 34s. Estimated remaining time: 41h 23m 31s. Estimated total time: 41h 37m 56s. Time estimates for 10 more iterations: 8m 19s, 100 more iterations: 1h 23m 15s, 500 more iterations: 6h 56m 19s. [2025-11-12 22:24:06,118][__main__][INFO] - Starting iteration 14. [2025-11-12 22:24:06,596][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-12 22:24:06,597][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:24:08,342][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 5 books, 5 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:24:20,670][__main__][INFO] - Number of regex retries in iteration 14: 1 [2025-11-12 22:24:20,670][__main__][INFO] - agents played in iteration 14 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:24:21,580][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 50.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:24:21,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 50.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:24:21,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 50.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:24:21,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 50.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:24:21,657][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:24:21,657][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:24:22,266][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:24:22,729][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:24:23,231][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:24:23,731][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:24:24,230][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:24:24,726][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:24:25,225][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:24:25,723][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:24:26,219][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:24:26,718][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:24:27,214][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:24:27,710][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:24:28,205][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:24:28,700][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:24:29,199][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:24:29,693][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:24:30,190][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:24:30,685][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:24:31,181][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:24:31,681][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:24:32,176][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:24:32,674][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:24:33,180][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:24:33,676][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:24:34,171][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:24:34,667][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:24:35,162][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:24:35,666][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:24:36,161][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:24:36,657][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:24:37,172][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:24:37,667][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:24:38,163][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:24:38,659][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:24:39,152][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:24:39,664][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:24:40,158][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:24:40,653][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:24:41,167][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:24:41,661][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:24:42,155][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:24:42,653][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:24:43,148][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:24:43,658][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:24:44,153][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:24:44,649][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:24:45,161][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:24:45,659][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:24:46,154][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:24:46,655][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:24:47,153][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:24:47,658][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:24:48,153][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:24:48,647][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:24:49,144][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:24:49,639][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:24:50,135][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:24:50,632][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:24:51,131][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:24:51,631][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:24:52,131][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:24:52,630][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:24:53,137][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:24:53,634][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:24:54,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9664 tokens. [2025-11-12 22:24:54,852][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.96%, Current % of VRAM taken: 58.20%, Block Peak % of device VRAM: 61.62%, ΔTime: 00:00:32 [2025-11-12 22:24:55,598][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:24:55,600][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:24:55,602][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:24:56,477][__main__][INFO] - Iteration 15 took 49s (28.21% Gen, 70.03% Train). Generation: 14s, Training: 34s. Estimated remaining time: 41h 18m 49s. Estimated total time: 41h 34m 4s. Time estimates for 10 more iterations: 8m 18s, 100 more iterations: 1h 23m 8s, 500 more iterations: 6h 55m 40s. [2025-11-12 22:24:56,480][__main__][INFO] - Starting iteration 15. [2025-11-12 22:24:56,938][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-12 22:24:56,938][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:25:03,715][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:25:07,983][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:25:10,862][__main__][INFO] - Number of regex retries in iteration 15: 2 [2025-11-12 22:25:10,863][__main__][INFO] - agents played in iteration 15 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:25:11,722][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:25:11,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:25:11,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:25:11,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:25:11,798][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:25:11,799][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:25:12,439][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:25:12,894][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:25:13,395][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:25:13,894][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:25:14,390][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:25:14,888][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:25:15,385][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:25:15,881][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:25:16,381][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:25:16,876][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:25:17,371][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:25:17,868][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:25:18,364][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:25:18,859][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:25:19,360][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:25:19,856][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:25:20,368][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:25:20,869][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:25:21,367][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:25:21,864][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:25:22,388][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:25:22,888][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:25:23,388][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:25:23,884][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:25:24,382][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:25:24,879][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:25:25,374][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:25:25,871][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:25:26,367][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:25:26,873][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:25:27,369][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:25:27,866][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:25:28,364][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:25:28,858][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:25:29,353][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:25:29,852][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:25:30,346][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:25:30,857][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:25:31,352][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:25:31,847][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:25:32,352][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:25:32,849][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:25:33,351][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:25:33,846][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:25:34,341][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:25:34,844][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:25:35,340][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:25:35,839][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:25:36,335][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:25:36,830][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:25:37,325][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:25:37,817][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:25:38,312][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:25:38,809][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:25:39,302][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:25:39,797][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:25:40,290][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:25:40,785][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:25:41,288][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:25:41,782][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:25:42,279][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:25:42,780][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:25:43,277][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:25:43,775][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:25:44,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9710 tokens. [2025-11-12 22:25:45,011][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.97%, Current % of VRAM taken: 58.22%, Block Peak % of device VRAM: 61.83%, ΔTime: 00:00:32 [2025-11-12 22:25:45,742][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:25:45,744][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:25:45,746][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:25:46,720][__main__][INFO] - Iteration 16 took 49s (27.97% Gen, 70.07% Train). Generation: 13s, Training: 34s. Estimated remaining time: 41h 13m 3s. Estimated total time: 41h 29m 9s. Time estimates for 10 more iterations: 8m 17s, 100 more iterations: 1h 22m 58s, 500 more iterations: 6h 54m 51s. [2025-11-12 22:25:46,722][__main__][INFO] - Starting iteration 16. [2025-11-12 22:25:47,260][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-12 22:25:47,261][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:26:01,032][__main__][INFO] - Number of regex retries in iteration 16: 0 [2025-11-12 22:26:01,033][__main__][INFO] - agents played in iteration 16 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:26:01,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:26:01,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:26:01,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:26:01,942][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:26:01,943][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:26:01,944][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:26:02,599][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:26:03,057][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:26:03,563][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:26:04,058][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:26:04,553][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:26:05,052][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:26:05,547][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:26:06,050][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:26:06,545][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:26:07,039][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:26:07,547][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:26:08,047][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:26:08,541][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:26:09,037][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:26:09,534][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:26:10,031][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:26:10,526][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:26:11,027][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:26:11,522][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:26:12,019][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:26:12,516][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:26:13,016][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:26:13,512][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:26:14,015][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:26:14,511][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:26:15,012][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:26:15,507][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:26:16,001][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:26:16,501][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:26:16,997][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:26:17,492][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:26:17,990][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:26:18,486][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:26:18,981][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:26:19,476][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:26:19,971][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:26:20,478][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:26:20,973][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:26:21,471][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:26:21,971][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:26:22,466][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:26:22,962][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:26:23,458][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:26:23,955][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:26:24,462][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:26:24,958][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:26:25,455][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:26:25,963][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:26:26,459][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:26:26,957][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:26:27,453][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:26:27,950][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:26:28,447][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:26:28,942][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:26:29,437][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:26:29,934][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:26:30,430][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:26:30,928][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:26:31,424][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:26:31,919][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:26:32,418][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:26:32,917][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:26:33,416][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:26:33,915][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:26:34,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9675 tokens. [2025-11-12 22:26:35,170][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.94%, Current % of VRAM taken: 58.19%, Block Peak % of device VRAM: 61.58%, ΔTime: 00:00:32 [2025-11-12 22:26:35,885][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:26:35,887][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:26:35,890][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:26:36,790][__main__][INFO] - Iteration 17 took 49s (27.80% Gen, 70.38% Train). Generation: 13s, Training: 34s. Estimated remaining time: 40h 59m 36s. Estimated total time: 41h 16m 31s. Time estimates for 10 more iterations: 8m 15s, 100 more iterations: 1h 22m 33s, 500 more iterations: 6h 52m 45s. [2025-11-12 22:26:36,792][__main__][INFO] - Starting iteration 17. [2025-11-12 22:26:37,289][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-12 22:26:37,290][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:26:51,471][__main__][INFO] - Number of regex retries in iteration 17: 0 [2025-11-12 22:26:51,472][__main__][INFO] - agents played in iteration 17 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:26:52,282][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.23%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:26:52,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.23%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:26:52,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.23%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:26:52,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.23%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:26:52,358][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:26:52,359][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:26:52,971][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:26:53,430][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:26:53,933][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:26:54,441][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:26:54,937][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:26:55,438][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:26:55,937][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:26:56,433][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:26:56,931][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:26:57,433][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:26:57,933][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:26:58,429][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:26:58,925][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:26:59,421][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:26:59,918][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:27:00,415][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:27:00,912][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:27:01,409][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:27:01,904][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:27:02,406][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:27:02,906][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:27:03,405][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:27:03,905][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:27:04,402][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:27:04,897][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:27:05,393][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:27:05,894][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:27:06,390][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:27:06,889][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:27:07,383][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:27:07,894][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:27:08,388][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:27:08,882][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:27:09,381][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:27:09,875][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:27:10,376][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:27:10,871][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:27:11,371][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:27:11,866][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:27:12,360][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:27:12,859][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:27:13,354][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:27:13,850][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:27:14,352][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:27:14,852][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:27:15,352][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:27:15,851][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:27:16,376][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:27:16,872][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:27:17,378][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:27:17,873][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:27:18,370][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:27:18,868][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:27:19,365][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:27:19,863][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:27:20,359][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:27:20,855][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:27:21,351][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:27:21,843][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:27:22,337][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:27:22,832][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:27:23,333][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:27:23,831][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:27:24,330][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:27:24,830][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9742 tokens. [2025-11-12 22:27:25,504][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.98%, Current % of VRAM taken: 58.23%, Block Peak % of device VRAM: 61.69%, ΔTime: 00:00:32 [2025-11-12 22:27:26,249][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:27:26,251][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:27:26,252][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:27:27,167][__main__][INFO] - Iteration 18 took 49s (28.43% Gen, 69.73% Train). Generation: 14s, Training: 34s. Estimated remaining time: 41h 16m 9s. Estimated total time: 41h 33m 55s. Time estimates for 10 more iterations: 8m 18s, 100 more iterations: 1h 23m 7s, 500 more iterations: 6h 55m 39s. [2025-11-12 22:27:27,169][__main__][INFO] - Starting iteration 18. [2025-11-12 22:27:27,647][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-12 22:27:27,648][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:27:31,870][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:27:32,203][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:27:40,919][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:27:41,145][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 30 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:27:41,862][__main__][INFO] - Number of regex retries in iteration 18: 4 [2025-11-12 22:27:41,862][__main__][INFO] - agents played in iteration 18 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:27:42,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:27:42,709][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:27:42,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:27:42,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:27:42,760][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:27:42,761][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:27:43,425][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:27:43,886][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:27:44,387][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:27:44,904][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:27:45,401][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:27:45,900][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:27:46,398][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:27:46,893][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:27:47,403][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:27:47,899][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:27:48,396][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:27:48,910][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:27:49,405][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:27:49,903][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:27:50,400][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:27:50,894][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:27:51,406][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:27:51,904][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:27:52,399][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:27:52,896][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:27:53,391][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:27:53,926][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:27:54,429][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:27:54,924][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:27:55,424][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:27:55,921][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:27:56,418][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:27:56,911][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:27:57,405][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:27:57,907][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:27:58,401][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:27:58,896][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:27:59,393][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:27:59,890][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:28:00,386][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:28:00,878][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:28:01,372][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:28:01,873][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:28:02,369][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:28:02,865][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:28:03,369][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:28:03,865][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:28:04,359][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:28:04,855][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:28:05,353][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:28:05,851][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:28:06,348][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:28:06,843][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:28:07,339][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:28:07,835][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:28:08,338][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:28:08,836][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:28:09,334][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:28:09,835][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:28:10,333][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:28:10,830][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:28:11,325][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:28:11,820][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:28:12,320][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:28:12,816][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:28:13,314][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:28:13,822][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:28:14,321][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:28:14,825][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:28:15,330][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9775 tokens. [2025-11-12 22:28:16,038][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.03%, Current % of VRAM taken: 58.27%, Block Peak % of device VRAM: 61.81%, ΔTime: 00:00:32 [2025-11-12 22:28:16,812][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:28:16,814][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:28:16,816][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:28:17,758][__main__][INFO] - Iteration 19 took 50s (28.37% Gen, 69.75% Train). Generation: 14s, Training: 34s. Estimated remaining time: 41h 27m 0s. Estimated total time: 41h 45m 36s. Time estimates for 10 more iterations: 8m 21s, 100 more iterations: 1h 23m 31s, 500 more iterations: 6h 57m 36s. [2025-11-12 22:28:17,760][__main__][INFO] - Starting iteration 19. [2025-11-12 22:28:18,255][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-12 22:28:18,255][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:28:19,558][mllm.models.large_language_model_local][WARNING] - Response Proposal: 5 hats, 5 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:28:19,591][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:28:31,631][__main__][INFO] - Number of regex retries in iteration 19: 2 [2025-11-12 22:28:31,632][__main__][INFO] - agents played in iteration 19 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:28:32,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:28:32,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:28:32,610][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:28:32,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:28:32,634][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:28:32,635][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:28:33,298][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:28:33,757][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:28:34,260][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:28:34,759][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:28:35,259][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:28:35,759][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:28:36,257][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:28:36,751][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:28:37,250][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:28:37,749][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:28:38,247][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:28:38,743][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:28:39,239][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:28:39,737][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:28:40,233][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:28:40,730][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:28:41,229][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:28:41,725][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:28:42,221][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:28:42,716][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:28:43,212][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:28:43,724][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:28:44,220][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:28:44,715][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:28:45,214][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:28:45,709][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:28:46,214][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:28:46,709][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:28:47,204][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:28:47,700][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:28:48,194][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:28:48,688][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:28:49,181][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:28:49,676][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:28:50,173][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:28:50,668][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:28:51,163][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:28:51,659][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:28:52,155][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:28:52,651][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:28:53,146][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:28:53,639][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:28:54,135][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:28:54,630][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:28:55,124][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:28:55,617][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:28:56,113][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:28:56,617][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:28:57,111][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:28:57,607][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:28:58,113][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:28:58,610][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:28:59,105][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:28:59,603][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:29:00,098][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:29:00,593][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:29:01,090][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:29:01,584][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:29:02,090][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:29:02,585][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:29:03,080][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:29:03,591][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:29:04,085][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:29:04,584][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:29:05,083][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9622 tokens. [2025-11-12 22:29:05,773][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.98%, Current % of VRAM taken: 58.22%, Block Peak % of device VRAM: 61.58%, ΔTime: 00:00:32 [2025-11-12 22:29:06,522][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:29:06,524][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:29:06,525][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:29:07,459][__main__][INFO] - Iteration 20 took 49s (27.19% Gen, 70.91% Train). Generation: 13s, Training: 34s. Estimated remaining time: 40h 40m 50s. Estimated total time: 41h 0m 16s. Time estimates for 10 more iterations: 8m 12s, 100 more iterations: 1h 22m 0s, 500 more iterations: 6h 50m 2s. [2025-11-12 22:29:07,462][__main__][INFO] - Starting iteration 20. [2025-11-12 22:29:07,951][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2025-11-12 22:29:07,952][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:29:10,472][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:29:22,838][__main__][INFO] - Number of regex retries in iteration 20: 1 [2025-11-12 22:29:22,839][__main__][INFO] - agents played in iteration 20 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:29:23,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:29:23,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:29:23,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:29:23,784][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:29:23,785][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:29:23,786][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:29:24,416][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:29:24,873][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:29:25,376][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:29:25,884][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:29:26,384][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:29:26,891][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:29:27,395][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:29:27,893][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:29:28,402][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:29:28,898][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:29:29,394][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:29:29,896][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:29:30,391][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:29:30,892][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:29:31,386][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:29:31,880][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:29:32,375][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:29:32,871][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:29:33,365][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:29:33,861][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:29:34,357][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:29:34,854][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:29:35,351][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:29:35,847][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:29:36,345][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:29:36,841][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:29:37,337][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:29:37,833][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:29:38,330][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:29:38,833][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:29:39,329][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:29:39,824][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:29:40,319][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:29:40,815][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:29:41,310][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:29:41,806][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:29:42,302][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:29:42,800][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:29:43,295][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:29:43,790][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:29:44,287][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:29:44,782][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:29:45,288][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:29:45,783][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:29:46,279][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:29:46,775][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:29:47,272][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:29:47,767][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:29:48,262][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:29:48,756][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:29:49,253][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:29:49,747][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:29:50,240][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:29:50,740][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:29:51,235][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:29:51,733][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:29:52,228][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:29:52,726][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:29:53,233][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:29:53,727][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:29:54,224][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:29:54,719][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:29:55,217][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:29:55,721][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:29:56,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9696 tokens. [2025-11-12 22:29:56,916][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.94%, Current % of VRAM taken: 58.19%, Block Peak % of device VRAM: 61.59%, ΔTime: 00:00:32 [2025-11-12 22:29:57,690][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:29:57,692][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:29:57,694][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:29:59,534][__main__][INFO] - Iteration 21 took 51s (28.86% Gen, 67.57% Train). Generation: 14s, Training: 34s. Estimated remaining time: 42h 38m 50s. Estimated total time: 42h 59m 8s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 58s, 500 more iterations: 7h 9m 51s. [2025-11-12 22:29:59,537][__main__][INFO] - Starting iteration 21. [2025-11-12 22:30:00,029][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-12 22:30:00,029][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:30:13,845][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls Given my per-item values (hats=1, books=10, balls=10) and knowing Bob's per-item values (hats=10, books=10, balls=1), I should prioritize taking items that I value more highly. Since I have very low value for hats (1), while Bob values hats significantly more (10), I should not propose to keep any hats. Books and balls are valued equally between us (10 for hats and 10 for books, 10 for books and 1 for balls), but since I value books and balls the same and they are worth more to me than hats, I should propose to keep all books and balls. However, given that I must keep the total quantity of items within 10, proposing 0 for both books and balls would ensure that I maximize the quantity of my high-value items. Thus, the optimal proposal is to take all 10 hats. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:30:16,372][__main__][INFO] - Number of regex retries in iteration 21: 1 [2025-11-12 22:30:16,373][__main__][INFO] - agents played in iteration 21 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:30:17,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:30:17,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:30:17,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:30:17,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:30:17,292][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:30:17,293][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:30:17,948][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:30:18,400][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:30:18,912][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:30:19,414][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:30:19,912][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:30:20,419][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:30:20,916][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:30:21,415][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:30:21,913][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:30:22,409][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:30:22,904][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:30:23,407][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:30:23,902][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:30:24,416][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:30:24,915][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:30:25,411][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:30:25,913][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:30:26,410][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:30:26,904][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:30:27,401][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:30:27,923][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:30:28,438][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:30:28,935][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:30:29,432][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:30:29,929][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:30:30,421][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:30:30,927][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:30:31,421][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:30:31,914][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:30:32,412][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:30:32,908][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:30:33,402][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:30:33,907][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:30:34,401][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:30:34,901][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:30:35,396][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:30:35,891][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:30:36,394][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:30:36,892][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:30:37,390][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:30:37,887][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:30:38,380][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:30:38,876][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:30:39,368][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:30:39,861][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:30:40,357][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:30:40,850][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:30:41,343][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:30:41,836][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:30:42,334][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:30:42,844][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:30:43,348][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:30:43,844][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:30:44,343][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:30:44,838][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:30:45,338][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:30:45,836][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:30:46,334][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:30:46,837][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:30:47,335][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:30:47,834][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:30:48,345][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:30:48,845][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:30:49,347][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:30:49,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9730 tokens. [2025-11-12 22:30:50,508][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.00%, Current % of VRAM taken: 58.25%, Block Peak % of device VRAM: 61.82%, ΔTime: 00:00:32 [2025-11-12 22:30:51,243][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:30:51,245][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:30:51,250][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:30:52,226][__main__][INFO] - Iteration 22 took 52s (31.31% Gen, 66.82% Train). Generation: 16s, Training: 34s. Estimated remaining time: 43h 8m 44s. Estimated total time: 43h 29m 55s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 59s, 500 more iterations: 7h 14m 59s. [2025-11-12 22:30:52,229][__main__][INFO] - Starting iteration 22. [2025-11-12 22:30:52,717][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-12 22:30:52,718][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:31:06,840][__main__][INFO] - Number of regex retries in iteration 22: 0 [2025-11-12 22:31:06,841][__main__][INFO] - agents played in iteration 22 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:31:07,700][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:31:07,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:31:07,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:31:07,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:31:07,774][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:31:07,775][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:31:08,429][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:31:08,885][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:31:09,390][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:31:09,890][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:31:10,386][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:31:10,881][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:31:11,378][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:31:11,872][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:31:12,368][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:31:12,866][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:31:13,365][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:31:13,862][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:31:14,359][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:31:14,856][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:31:15,352][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:31:15,850][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:31:16,345][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:31:16,840][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:31:17,358][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:31:17,858][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:31:18,355][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:31:18,855][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:31:19,355][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:31:19,852][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:31:20,351][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:31:20,844][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:31:21,340][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:31:21,843][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:31:22,339][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:31:22,833][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:31:23,333][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:31:23,830][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:31:24,327][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:31:24,830][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:31:25,329][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:31:25,843][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:31:26,342][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:31:26,838][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:31:27,355][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:31:27,853][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:31:28,354][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:31:28,853][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:31:29,354][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:31:29,851][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:31:30,350][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:31:30,846][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:31:31,356][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:31:31,854][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:31:32,354][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:31:32,849][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:31:33,344][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:31:33,840][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:31:34,336][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:31:34,834][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:31:35,349][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:31:35,845][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:31:36,342][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:31:36,841][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:31:37,338][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:31:37,855][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:31:38,356][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:31:38,859][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:31:39,363][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:31:39,864][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:31:40,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9727 tokens. [2025-11-12 22:31:41,044][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.00%, Current % of VRAM taken: 58.25%, Block Peak % of device VRAM: 61.82%, ΔTime: 00:00:32 [2025-11-12 22:31:41,775][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:31:41,776][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:31:41,778][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:31:42,762][__main__][INFO] - Iteration 23 took 50s (28.22% Gen, 69.81% Train). Generation: 14s, Training: 34s. Estimated remaining time: 41h 20m 13s. Estimated total time: 41h 42m 14s. Time estimates for 10 more iterations: 8m 20s, 100 more iterations: 1h 23m 24s, 500 more iterations: 6h 57m 2s. [2025-11-12 22:31:42,764][__main__][INFO] - Starting iteration 23. [2025-11-12 22:31:43,239][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-12 22:31:43,239][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:31:44,268][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:31:49,069][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls Given Bob's per-item values, he highly values hats while valuing books more than I do. Since I value hats and books lowly, and balls neutrally, I should focus on claiming all the hats to maximize my points. By taking all 10 hats, I ensure that I get the full value of 10 points per hat, maximizing my potential points for this round. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:31:50,326][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 2/3 [2025-11-12 22:31:57,514][__main__][INFO] - Number of regex retries in iteration 23: 3 [2025-11-12 22:31:57,515][__main__][INFO] - agents played in iteration 23 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:31:58,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:31:58,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:31:58,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:31:58,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:31:58,533][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:31:58,534][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:31:59,178][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:31:59,636][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:32:00,141][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:32:00,639][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:32:01,136][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:32:01,646][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:32:02,145][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:32:02,641][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:32:03,139][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:32:03,635][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:32:04,133][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:32:04,632][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:32:05,129][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:32:05,625][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:32:06,121][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:32:06,622][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:32:07,135][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:32:07,632][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:32:08,127][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:32:08,626][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:32:09,121][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:32:09,621][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:32:10,117][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:32:10,612][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:32:11,110][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:32:11,605][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:32:12,100][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:32:12,594][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:32:13,090][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:32:13,586][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:32:14,081][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:32:14,576][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:32:15,077][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:32:15,573][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:32:16,071][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:32:16,567][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:32:17,060][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:32:17,560][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:32:18,055][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:32:18,551][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:32:19,049][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:32:19,544][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:32:20,040][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:32:20,537][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:32:21,033][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:32:21,546][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:32:22,042][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:32:22,537][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:32:23,035][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:32:23,533][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:32:24,041][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:32:24,536][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:32:25,034][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:32:25,532][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:32:26,029][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:32:26,525][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:32:27,025][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:32:27,524][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:32:28,044][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:32:28,543][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:32:29,044][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:32:29,545][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:32:30,045][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:32:30,553][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:32:31,050][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9853 tokens. [2025-11-12 22:32:31,770][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.96%, Current % of VRAM taken: 58.21%, Block Peak % of device VRAM: 61.85%, ΔTime: 00:00:32 [2025-11-12 22:32:32,542][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:32:32,543][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:32:32,545][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:32:33,539][__main__][INFO] - Iteration 24 took 50s (28.38% Gen, 69.64% Train). Generation: 14s, Training: 35s. Estimated remaining time: 41h 32m 11s. Estimated total time: 41h 55m 3s. Time estimates for 10 more iterations: 8m 23s, 100 more iterations: 1h 23m 50s, 500 more iterations: 6h 59m 10s. [2025-11-12 22:32:33,541][__main__][INFO] - Starting iteration 24. [2025-11-12 22:32:34,121][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-12 22:32:34,122][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:32:48,487][__main__][INFO] - Number of regex retries in iteration 24: 0 [2025-11-12 22:32:48,488][__main__][INFO] - agents played in iteration 24 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:32:49,353][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:32:49,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:32:49,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:32:49,424][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:32:49,424][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:32:49,425][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:32:50,028][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:32:50,487][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:32:50,988][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:32:51,487][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:32:51,991][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:32:52,490][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:32:52,988][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:32:53,483][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:32:53,980][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:32:54,481][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:32:54,979][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:32:55,480][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:32:56,021][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:32:56,519][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:32:57,027][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:32:57,528][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:32:58,032][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:32:58,533][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:32:59,033][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:32:59,544][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:33:00,042][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:33:00,540][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:33:01,045][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:33:01,542][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:33:02,054][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:33:02,549][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:33:03,043][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:33:03,545][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:33:04,040][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:33:04,534][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:33:05,032][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:33:05,525][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:33:06,021][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:33:06,515][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:33:07,012][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:33:07,507][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:33:08,001][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:33:08,499][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:33:08,994][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:33:09,488][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:33:09,990][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:33:10,485][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:33:10,980][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:33:11,477][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:33:11,972][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:33:12,483][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:33:12,977][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:33:13,476][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:33:13,971][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:33:14,466][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:33:14,973][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:33:15,468][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:33:15,962][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:33:16,478][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:33:16,976][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:33:17,476][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:33:17,974][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:33:18,473][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:33:18,985][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:33:19,484][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:33:19,986][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:33:20,484][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:33:20,983][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:33:21,485][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:33:21,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9807 tokens. [2025-11-12 22:33:22,678][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.32%, Block Peak % of device VRAM: 61.77%, ΔTime: 00:00:32 [2025-11-12 22:33:23,460][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:33:23,462][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:33:23,464][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:33:24,383][__main__][INFO] - Iteration 25 took 50s (28.58% Gen, 69.59% Train). Generation: 14s, Training: 34s. Estimated remaining time: 41h 29m 24s. Estimated total time: 41h 53m 6s. Time estimates for 10 more iterations: 8m 22s, 100 more iterations: 1h 23m 46s, 500 more iterations: 6h 58m 51s. [2025-11-12 22:33:24,385][__main__][INFO] - Starting iteration 25. [2025-11-12 22:33:24,870][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-12 22:33:24,871][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:33:25,937][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:33:34,493][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:33:36,692][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:33:38,890][__main__][INFO] - Number of regex retries in iteration 25: 3 [2025-11-12 22:33:38,891][__main__][INFO] - agents played in iteration 25 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:33:39,737][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:33:39,761][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:33:39,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:33:39,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:33:39,823][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:33:39,824][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:33:40,431][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:33:40,886][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:33:41,395][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:33:41,893][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:33:42,393][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:33:42,893][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:33:43,393][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:33:43,894][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:33:44,392][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:33:44,886][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:33:45,383][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:33:45,878][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:33:46,374][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:33:46,870][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:33:47,364][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:33:47,871][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:33:48,364][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:33:48,860][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:33:49,375][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:33:49,873][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:33:50,371][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:33:50,869][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:33:51,365][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:33:51,875][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:33:52,372][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:33:52,868][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:33:53,372][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:33:53,869][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:33:54,366][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:33:54,862][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:33:55,359][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:33:55,871][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:33:56,368][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:33:56,861][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:33:57,359][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:33:57,855][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:33:58,351][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:33:58,845][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:33:59,340][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:33:59,835][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:34:00,330][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:34:00,826][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:34:01,321][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:34:01,816][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:34:02,313][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:34:02,811][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:34:03,307][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:34:03,803][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:34:04,298][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:34:04,795][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:34:05,299][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:34:05,798][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:34:06,297][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:34:06,794][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:34:07,291][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:34:07,797][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:34:08,296][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:34:08,798][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:34:09,296][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:34:09,797][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:34:10,299][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:34:10,800][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:34:11,298][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:34:11,799][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:34:12,300][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9804 tokens. [2025-11-12 22:34:12,980][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 61.85%, ΔTime: 00:00:32 [2025-11-12 22:34:13,765][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:34:13,767][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:34:13,769][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:34:14,704][__main__][INFO] - Iteration 26 took 49s (28.13% Gen, 69.99% Train). Generation: 14s, Training: 34s. Estimated remaining time: 41h 7m 10s. Estimated total time: 41h 31m 43s. Time estimates for 10 more iterations: 8m 18s, 100 more iterations: 1h 23m 3s, 500 more iterations: 6h 55m 17s. [2025-11-12 22:34:14,706][__main__][INFO] - Starting iteration 26. [2025-11-12 22:34:15,175][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-12 22:34:15,176][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:34:16,272][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:34:16,406][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:34:28,953][__main__][INFO] - Number of regex retries in iteration 26: 2 [2025-11-12 22:34:28,953][__main__][INFO] - agents played in iteration 26 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:34:29,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:34:29,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:34:29,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:34:29,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:34:29,886][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:34:29,887][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:34:30,504][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:34:30,958][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:34:31,469][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:34:31,969][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:34:32,466][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:34:32,962][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:34:33,459][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:34:33,969][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:34:34,467][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:34:34,962][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:34:35,470][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:34:35,966][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:34:36,469][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:34:36,967][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:34:37,461][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:34:37,972][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:34:38,469][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:34:38,966][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:34:39,463][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:34:39,959][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:34:40,460][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:34:40,960][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:34:41,456][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:34:41,966][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:34:42,463][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:34:42,959][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:34:43,458][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:34:43,956][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:34:44,456][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:34:44,957][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:34:45,454][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:34:45,952][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:34:46,447][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:34:46,945][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:34:47,443][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:34:47,939][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:34:48,437][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:34:48,935][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:34:49,431][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:34:49,927][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:34:50,422][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:34:50,920][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:34:51,415][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:34:51,913][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:34:52,410][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:34:52,905][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:34:53,403][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:34:53,899][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:34:54,394][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:34:54,904][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:34:55,403][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:34:55,904][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:34:56,401][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:34:56,899][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:34:57,405][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:34:57,905][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:34:58,405][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:34:58,910][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:34:59,412][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:34:59,911][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:35:00,408][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:35:00,907][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:35:01,408][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:35:01,906][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:35:02,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9777 tokens. [2025-11-12 22:35:03,132][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.00%, Current % of VRAM taken: 58.24%, Block Peak % of device VRAM: 61.75%, ΔTime: 00:00:32 [2025-11-12 22:35:03,919][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:35:03,920][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:35:03,924][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:35:04,856][__main__][INFO] - Iteration 27 took 49s (27.73% Gen, 70.39% Train). Generation: 13s, Training: 34s. Estimated remaining time: 40h 58m 41s. Estimated total time: 41h 24m 4s. Time estimates for 10 more iterations: 8m 16s, 100 more iterations: 1h 22m 48s, 500 more iterations: 6h 54m 0s. [2025-11-12 22:35:04,858][__main__][INFO] - Starting iteration 27. [2025-11-12 22:35:05,339][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-12 22:35:05,340][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:35:06,470][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:35:19,728][__main__][INFO] - Number of regex retries in iteration 27: 1 [2025-11-12 22:35:19,729][__main__][INFO] - agents played in iteration 27 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:35:20,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:35:20,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:35:20,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:35:20,743][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:35:20,744][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:35:20,745][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:35:21,384][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:35:21,838][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:35:22,340][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:35:22,840][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:35:23,340][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:35:23,838][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:35:24,335][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:35:24,831][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:35:25,328][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:35:25,822][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:35:26,319][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:35:26,814][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:35:27,315][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:35:27,810][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:35:28,305][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:35:28,815][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:35:29,315][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:35:29,810][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:35:30,305][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:35:30,799][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:35:31,305][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:35:31,798][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:35:32,293][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:35:32,808][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:35:33,304][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:35:33,797][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:35:34,291][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:35:34,782][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:35:35,302][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:35:35,796][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:35:36,290][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:35:36,802][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:35:37,294][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:35:37,787][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:35:38,286][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:35:38,779][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:35:39,287][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:35:39,780][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:35:40,273][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:35:40,779][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:35:41,272][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:35:41,767][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:35:42,260][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:35:42,754][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:35:43,247][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:35:43,741][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:35:44,237][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:35:44,736][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:35:45,235][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:35:45,734][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:35:46,233][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:35:46,729][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:35:47,230][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:35:47,729][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:35:48,228][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:35:48,732][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:35:49,231][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:35:49,728][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:35:50,227][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:35:50,723][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:35:51,229][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:35:51,724][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:35:52,220][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:35:52,720][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:35:53,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9743 tokens. [2025-11-12 22:35:53,923][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.00%, Current % of VRAM taken: 58.25%, Block Peak % of device VRAM: 61.67%, ΔTime: 00:00:32 [2025-11-12 22:35:54,687][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:35:54,689][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:35:54,691][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:35:55,658][__main__][INFO] - Iteration 28 took 50s (28.59% Gen, 69.48% Train). Generation: 14s, Training: 34s. Estimated remaining time: 41h 29m 45s. Estimated total time: 41h 55m 59s. Time estimates for 10 more iterations: 8m 23s, 100 more iterations: 1h 23m 51s, 500 more iterations: 6h 59m 19s. [2025-11-12 22:35:55,661][__main__][INFO] - Starting iteration 28. [2025-11-12 22:35:56,204][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-12 22:35:56,204][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:35:58,119][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:36:05,340][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls Given Alice's values and my own, I see that: - Both of us value hats the same at 10 each, so it's a clear advantage if I can secure all the hats. - Alice values books more than I do, so I should avoid proposing for any books. - Alice values balls more than I do, so I should also avoid proposing for any balls. By proposing to take all 10 hats, I aim to maximize my points since hats have the highest value for both of us and I stand to gain the most points from them. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:36:11,031][__main__][INFO] - Number of regex retries in iteration 28: 2 [2025-11-12 22:36:11,032][__main__][INFO] - agents played in iteration 28 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:36:11,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:36:12,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:36:12,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:36:12,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:36:12,071][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:36:12,072][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:36:12,665][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:36:13,120][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:36:13,621][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:36:14,119][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:36:14,620][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:36:15,117][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:36:15,617][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:36:16,115][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:36:16,613][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:36:17,118][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:36:17,614][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:36:18,113][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:36:18,626][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:36:19,120][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:36:19,615][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:36:20,112][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:36:20,607][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:36:21,116][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:36:21,614][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:36:22,111][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:36:22,627][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:36:23,129][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:36:23,626][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:36:24,121][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:36:24,617][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:36:25,125][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:36:25,621][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:36:26,116][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:36:26,613][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:36:27,108][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:36:27,614][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:36:28,108][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:36:28,603][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:36:29,102][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:36:29,597][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:36:30,092][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:36:30,588][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:36:31,082][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:36:31,581][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:36:32,076][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:36:32,572][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:36:33,072][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:36:33,566][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:36:34,062][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:36:34,559][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:36:35,057][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:36:35,555][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:36:36,056][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:36:36,555][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:36:37,057][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:36:37,555][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:36:38,053][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:36:38,557][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:36:39,055][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:36:39,556][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:36:40,053][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:36:40,552][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:36:41,049][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:36:41,548][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:36:42,047][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:36:42,545][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:36:43,040][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:36:43,545][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:36:44,043][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:36:44,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9767 tokens. [2025-11-12 22:36:45,224][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.02%, Current % of VRAM taken: 58.26%, Block Peak % of device VRAM: 61.68%, ΔTime: 00:00:32 [2025-11-12 22:36:46,003][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:36:46,004][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:36:46,006][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:36:46,992][__main__][INFO] - Iteration 29 took 50s (29.19% Gen, 68.86% Train). Generation: 14s, Training: 34s. Estimated remaining time: 41h 52m 20s. Estimated total time: 42h 19m 26s. Time estimates for 10 more iterations: 8m 27s, 100 more iterations: 1h 24m 38s, 500 more iterations: 7h 3m 14s. [2025-11-12 22:36:46,994][__main__][INFO] - Starting iteration 29. [2025-11-12 22:36:47,478][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-12 22:36:47,478][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:36:56,529][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:37:01,893][__main__][INFO] - Number of regex retries in iteration 29: 1 [2025-11-12 22:37:01,894][__main__][INFO] - agents played in iteration 29 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:37:02,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:37:02,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:37:02,789][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:37:02,812][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:37:02,812][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:37:02,814][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:37:03,433][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:37:03,889][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:37:04,415][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:37:04,912][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:37:05,414][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:37:05,926][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:37:06,423][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:37:06,923][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:37:07,420][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:37:07,916][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:37:08,424][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:37:08,924][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:37:09,421][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:37:09,917][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:37:10,413][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:37:10,922][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:37:11,418][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:37:11,915][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:37:12,425][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:37:12,921][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:37:13,418][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:37:13,915][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:37:14,415][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:37:14,921][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:37:15,424][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:37:15,920][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:37:16,431][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:37:16,929][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:37:17,424][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:37:17,920][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:37:18,417][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:37:18,929][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:37:19,425][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:37:19,920][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:37:20,415][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:37:20,912][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:37:21,407][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:37:21,900][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:37:22,395][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:37:22,891][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:37:23,393][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:37:23,891][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:37:24,403][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:37:24,903][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:37:25,416][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:37:25,920][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:37:26,421][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:37:26,920][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:37:27,421][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:37:27,925][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:37:28,425][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:37:28,927][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:37:29,436][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:37:29,936][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:37:30,439][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:37:30,935][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:37:31,435][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:37:31,939][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:37:32,437][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:37:32,937][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:37:33,436][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:37:33,933][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:37:34,440][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:37:34,943][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:37:35,446][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9812 tokens. [2025-11-12 22:37:36,136][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.05%, Current % of VRAM taken: 58.30%, Block Peak % of device VRAM: 61.89%, ΔTime: 00:00:32 [2025-11-12 22:37:36,925][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:37:36,926][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:37:36,928][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:37:37,894][__main__][INFO] - Iteration 30 took 50s (28.59% Gen, 69.49% Train). Generation: 14s, Training: 35s. Estimated remaining time: 41h 32m 53s. Estimated total time: 42h 0m 49s. Time estimates for 10 more iterations: 8m 24s, 100 more iterations: 1h 24m 1s, 500 more iterations: 7h 0m 8s. [2025-11-12 22:37:37,896][__main__][INFO] - Starting iteration 30. [2025-11-12 22:37:38,373][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2025-11-12 22:37:38,374][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:37:50,362][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls Given the information: - I value hats and balls at 1 each. - I value books at 10 each. - Bob values hats and balls at 10 each. - Bob values books at 1 each. Since I have a much higher value for books compared to Bob, proposing to get all the books would maximize my points for this round. Here, I am proposing to take all 10 books, while not taking any hats or balls. This would ensure that if my proposal matches Bob's, or if the total sum of proposals is equal to the item quantity, I will receive the full value of all books, which is 100 points (10 books * 10 points per book). did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:37:53,766][__main__][INFO] - Number of regex retries in iteration 30: 1 [2025-11-12 22:37:53,767][__main__][INFO] - agents played in iteration 30 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:37:54,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:37:54,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:37:54,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:37:54,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:37:54,777][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:37:54,778][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:37:55,371][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:37:55,932][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:37:56,440][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:37:56,938][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:37:57,435][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:37:57,932][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:37:58,441][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:37:58,936][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:37:59,433][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:37:59,931][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:38:00,427][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:38:00,937][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:38:01,433][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:38:01,932][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:38:02,445][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:38:02,942][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:38:03,439][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:38:03,940][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:38:04,463][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:38:04,968][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:38:05,472][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:38:05,977][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:38:06,478][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:38:06,977][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:38:07,477][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:38:07,974][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:38:08,473][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:38:08,971][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:38:09,468][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:38:09,964][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:38:10,464][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:38:10,961][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:38:11,461][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:38:11,958][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:38:12,458][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:38:12,960][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:38:13,459][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:38:13,959][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:38:14,462][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:38:14,962][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:38:15,464][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:38:15,962][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:38:16,461][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:38:16,960][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:38:17,457][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:38:17,957][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:38:18,460][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:38:18,963][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:38:19,466][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:38:19,967][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:38:20,467][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:38:20,967][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:38:21,466][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:38:21,966][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:38:22,464][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:38:22,964][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:38:23,464][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:38:23,964][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:38:24,461][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:38:24,963][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:38:25,462][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:38:25,964][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:38:26,462][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:38:26,961][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:38:27,458][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9894 tokens. [2025-11-12 22:38:28,132][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.06%, Current % of VRAM taken: 58.31%, Block Peak % of device VRAM: 61.97%, ΔTime: 00:00:32 [2025-11-12 22:38:28,951][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:38:28,952][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:38:28,955][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:38:30,930][__main__][INFO] - Iteration 31 took 52s (29.29% Gen, 66.95% Train). Generation: 15s, Training: 35s. Estimated remaining time: 43h 19m 2s. Estimated total time: 43h 47m 52s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 35s, 500 more iterations: 7h 17m 58s. [2025-11-12 22:38:30,932][__main__][INFO] - Starting iteration 31. [2025-11-12 22:38:31,403][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-12 22:38:31,404][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:38:36,661][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:38:45,816][__main__][INFO] - Number of regex retries in iteration 31: 1 [2025-11-12 22:38:45,817][__main__][INFO] - agents played in iteration 31 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:38:46,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:38:46,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:38:46,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:38:46,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:38:46,667][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:38:46,667][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:38:47,284][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:38:47,741][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:38:48,255][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:38:48,754][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:38:49,250][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:38:49,761][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:38:50,261][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:38:50,759][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:38:51,255][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:38:51,753][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:38:52,260][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:38:52,758][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:38:53,258][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:38:53,755][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:38:54,253][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:38:54,751][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:38:55,250][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:38:55,746][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:38:56,258][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:38:56,758][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:38:57,257][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:38:57,753][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:38:58,252][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:38:58,754][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:38:59,250][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:38:59,747][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:39:00,246][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:39:00,745][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:39:01,243][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:39:01,741][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:39:02,240][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:39:02,746][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:39:03,244][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:39:03,742][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:39:04,242][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:39:04,746][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:39:05,245][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:39:05,745][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:39:06,247][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:39:06,760][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:39:07,259][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:39:07,761][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:39:08,256][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:39:08,755][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:39:09,260][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:39:09,758][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:39:10,261][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:39:10,759][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:39:11,258][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:39:11,760][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:39:12,260][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:39:12,765][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:39:13,271][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:39:13,770][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:39:14,270][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:39:14,770][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:39:15,268][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:39:15,771][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:39:16,268][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:39:16,764][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:39:17,265][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:39:17,765][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:39:18,262][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:39:18,761][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:39:19,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9961 tokens. [2025-11-12 22:39:19,978][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.32%, Block Peak % of device VRAM: 61.91%, ΔTime: 00:00:32 [2025-11-12 22:39:20,757][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:39:20,759][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:39:20,761][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:39:21,763][__main__][INFO] - Iteration 32 took 50s (28.62% Gen, 69.39% Train). Generation: 14s, Training: 34s. Estimated remaining time: 41h 28m 18s. Estimated total time: 41h 57m 58s. Time estimates for 10 more iterations: 8m 23s, 100 more iterations: 1h 23m 55s, 500 more iterations: 6h 59m 39s. [2025-11-12 22:39:21,765][__main__][INFO] - Starting iteration 32. [2025-11-12 22:39:22,250][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-12 22:39:22,251][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:39:35,461][__main__][INFO] - Number of regex retries in iteration 32: 0 [2025-11-12 22:39:35,461][__main__][INFO] - agents played in iteration 32 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:39:36,250][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:39:36,286][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:39:36,314][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:39:36,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:39:36,340][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:39:36,341][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:39:36,955][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:39:37,412][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:39:37,923][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:39:38,424][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:39:38,922][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:39:39,419][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:39:39,916][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:39:40,421][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:39:40,918][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:39:41,417][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:39:41,926][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:39:42,422][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:39:42,938][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:39:43,434][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:39:43,933][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:39:44,434][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:39:44,933][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:39:45,440][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:39:45,936][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:39:46,434][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:39:46,934][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:39:47,433][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:39:47,928][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:39:48,433][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:39:48,932][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:39:49,451][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:39:49,948][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:39:50,445][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:39:50,944][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:39:51,440][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:39:51,937][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:39:52,435][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:39:52,932][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:39:53,440][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:39:53,936][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:39:54,433][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:39:54,938][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:39:55,438][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:39:55,941][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:39:56,440][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:39:56,939][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:39:57,455][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:39:57,953][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:39:58,454][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:39:58,955][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:39:59,454][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:39:59,971][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:40:00,472][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:40:00,975][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:40:01,480][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:40:01,981][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:40:02,481][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:40:02,980][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:40:03,478][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:40:03,978][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:40:04,476][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:40:04,976][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:40:05,473][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:40:05,973][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:40:06,475][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:40:06,975][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:40:07,473][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:40:07,976][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:40:08,476][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:40:08,975][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9857 tokens. [2025-11-12 22:40:09,650][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.02%, Current % of VRAM taken: 58.26%, Block Peak % of device VRAM: 61.78%, ΔTime: 00:00:32 [2025-11-12 22:40:10,452][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:40:10,457][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:40:10,459][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:40:11,390][__main__][INFO] - Iteration 33 took 49s (26.88% Gen, 71.22% Train). Generation: 13s, Training: 34s. Estimated remaining time: 40h 26m 32s. Estimated total time: 40h 57m 2s. Time estimates for 10 more iterations: 8m 11s, 100 more iterations: 1h 21m 54s, 500 more iterations: 6h 49m 30s. [2025-11-12 22:40:11,393][__main__][INFO] - Starting iteration 33. [2025-11-12 22:40:11,870][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-12 22:40:11,870][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:40:25,707][__main__][INFO] - Number of regex retries in iteration 33: 0 [2025-11-12 22:40:25,708][__main__][INFO] - agents played in iteration 33 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:40:26,491][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:40:26,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:40:26,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:40:26,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:40:26,567][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:40:26,567][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:40:27,172][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:40:27,643][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:40:28,144][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:40:28,641][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:40:29,143][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:40:29,643][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:40:30,149][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:40:30,646][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:40:31,142][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:40:31,657][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:40:32,156][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:40:32,655][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:40:33,155][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:40:33,653][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:40:34,168][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:40:34,664][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:40:35,162][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:40:35,659][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:40:36,155][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:40:36,661][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:40:37,159][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:40:37,656][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:40:38,155][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:40:38,653][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:40:39,149][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:40:39,644][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:40:40,142][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:40:40,654][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:40:41,150][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:40:41,649][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:40:42,150][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:40:42,647][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:40:43,154][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:40:43,653][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:40:44,150][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:40:44,661][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:40:45,155][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:40:45,653][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:40:46,155][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:40:46,656][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:40:47,162][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:40:47,685][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:40:48,184][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:40:48,685][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:40:49,183][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:40:49,683][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:40:50,187][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:40:50,689][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:40:51,191][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:40:51,695][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:40:52,195][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:40:52,694][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:40:53,193][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:40:53,697][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:40:54,196][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:40:54,695][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:40:55,196][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:40:55,695][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:40:56,194][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:40:56,694][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:40:57,191][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:40:57,689][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:40:58,185][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:40:58,686][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:40:59,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9800 tokens. [2025-11-12 22:40:59,872][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.06%, Current % of VRAM taken: 58.31%, Block Peak % of device VRAM: 61.83%, ΔTime: 00:00:32 [2025-11-12 22:41:00,686][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:41:00,688][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:41:00,690][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:41:01,612][__main__][INFO] - Iteration 34 took 49s (27.82% Gen, 70.33% Train). Generation: 13s, Training: 34s. Estimated remaining time: 40h 55m 49s. Estimated total time: 41h 27m 9s. Time estimates for 10 more iterations: 8m 17s, 100 more iterations: 1h 22m 54s, 500 more iterations: 6h 54m 31s. [2025-11-12 22:41:01,614][__main__][INFO] - Starting iteration 34. [2025-11-12 22:41:02,104][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-12 22:41:02,104][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:41:15,840][__main__][INFO] - Number of regex retries in iteration 34: 0 [2025-11-12 22:41:15,840][__main__][INFO] - agents played in iteration 34 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:41:16,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:41:16,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:41:16,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:41:16,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:41:16,688][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:41:16,689][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:41:17,279][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:41:17,740][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:41:18,241][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:41:18,736][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:41:19,232][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:41:19,726][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:41:20,223][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:41:20,719][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:41:21,215][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:41:21,713][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:41:22,209][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:41:22,706][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:41:23,201][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:41:23,696][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:41:24,195][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:41:24,692][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:41:25,188][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:41:25,681][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:41:26,175][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:41:26,679][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:41:27,173][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:41:27,667][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:41:28,181][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:41:28,677][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:41:29,174][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:41:29,672][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:41:30,167][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:41:30,677][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:41:31,175][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:41:31,671][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:41:32,169][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:41:32,664][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:41:33,170][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:41:33,669][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:41:34,165][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:41:34,663][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:41:35,159][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:41:35,657][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:41:36,153][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:41:36,652][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:41:37,167][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:41:37,669][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:41:38,170][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:41:38,673][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:41:39,175][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:41:39,684][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:41:40,180][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:41:40,680][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:41:41,188][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:41:41,688][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:41:42,187][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:41:42,686][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:41:43,185][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:41:43,692][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:41:44,195][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:41:44,694][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:41:45,192][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:41:45,690][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:41:46,187][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:41:46,685][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:41:47,181][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:41:47,688][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:41:48,184][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:41:48,687][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:41:49,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9775 tokens. [2025-11-12 22:41:49,861][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.02%, Current % of VRAM taken: 58.26%, Block Peak % of device VRAM: 61.68%, ΔTime: 00:00:32 [2025-11-12 22:41:50,668][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:41:50,675][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:41:50,676][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:41:51,603][__main__][INFO] - Iteration 35 took 49s (27.75% Gen, 70.38% Train). Generation: 13s, Training: 34s. Estimated remaining time: 40h 42m 49s. Estimated total time: 41h 14m 59s. Time estimates for 10 more iterations: 8m 14s, 100 more iterations: 1h 22m 29s, 500 more iterations: 6h 52m 29s. [2025-11-12 22:41:51,605][__main__][INFO] - Starting iteration 35. [2025-11-12 22:41:52,106][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-12 22:41:52,106][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:42:06,427][__main__][INFO] - Number of regex retries in iteration 35: 0 [2025-11-12 22:42:06,427][__main__][INFO] - agents played in iteration 35 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:42:07,294][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:42:07,323][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:42:07,350][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:42:07,373][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:42:07,373][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:42:07,374][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:42:07,981][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:42:08,437][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:42:08,944][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:42:09,442][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:42:09,947][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:42:10,447][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:42:10,948][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:42:11,454][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:42:11,958][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:42:12,463][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:42:12,959][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:42:13,454][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:42:13,970][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:42:14,465][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:42:14,960][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:42:15,459][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:42:15,956][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:42:16,464][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:42:16,959][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:42:17,456][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:42:17,956][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:42:18,452][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:42:18,949][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:42:19,445][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:42:19,941][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:42:20,462][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:42:20,958][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:42:21,456][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:42:21,956][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:42:22,454][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:42:22,951][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:42:23,450][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:42:23,947][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:42:24,457][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:42:24,955][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:42:25,456][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:42:25,956][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:42:26,452][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:42:26,955][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:42:27,452][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:42:27,948][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:42:28,449][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:42:28,951][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:42:29,454][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:42:29,959][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:42:30,459][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:42:30,970][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:42:31,473][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:42:31,975][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:42:32,495][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:42:33,021][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:42:33,532][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:42:34,038][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:42:34,542][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:42:35,046][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:42:35,547][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:42:36,049][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:42:36,558][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:42:37,056][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:42:37,556][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:42:38,060][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:42:38,558][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:42:39,061][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:42:39,565][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:42:40,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9931 tokens. [2025-11-12 22:42:40,772][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.03%, ΔTime: 00:00:32 [2025-11-12 22:42:41,591][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:42:41,593][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:42:41,594][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:42:42,536][__main__][INFO] - Iteration 36 took 50s (28.40% Gen, 69.73% Train). Generation: 14s, Training: 35s. Estimated remaining time: 41h 28m 32s. Estimated total time: 42h 1m 33s. Time estimates for 10 more iterations: 8m 24s, 100 more iterations: 1h 24m 3s, 500 more iterations: 7h 0m 15s. [2025-11-12 22:42:42,538][__main__][INFO] - Starting iteration 36. [2025-11-12 22:42:43,021][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-12 22:42:43,022][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:42:58,075][__main__][INFO] - Number of regex retries in iteration 36: 0 [2025-11-12 22:42:58,076][__main__][INFO] - agents played in iteration 36 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:42:58,931][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:42:58,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:42:58,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:42:59,002][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:42:59,003][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:42:59,004][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:42:59,603][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:43:00,060][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:43:00,558][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:43:01,057][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:43:01,567][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:43:02,063][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:43:02,562][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:43:03,066][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:43:03,562][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:43:04,065][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:43:04,560][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:43:05,056][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:43:05,556][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:43:06,052][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:43:06,550][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:43:07,046][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:43:07,542][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:43:08,044][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:43:08,542][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:43:09,049][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:43:09,582][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:43:10,077][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:43:10,595][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:43:11,094][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:43:11,594][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:43:12,087][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:43:12,582][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:43:13,085][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:43:13,578][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:43:14,075][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:43:14,586][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:43:15,081][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:43:15,578][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:43:16,077][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:43:16,574][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:43:17,086][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:43:17,581][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:43:18,078][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:43:18,575][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:43:19,072][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:43:19,584][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:43:20,082][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:43:20,580][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:43:21,094][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:43:21,594][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:43:22,090][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:43:22,591][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:43:23,091][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:43:23,596][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:43:24,095][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:43:24,594][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:43:25,094][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:43:25,594][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:43:26,097][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:43:26,596][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:43:27,094][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:43:27,592][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:43:28,089][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:43:28,587][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:43:29,085][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:43:29,581][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:43:30,078][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:43:30,579][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:43:31,078][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:43:31,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9939 tokens. [2025-11-12 22:43:32,244][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.02%, ΔTime: 00:00:32 [2025-11-12 22:43:33,019][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:43:33,021][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:43:33,023][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:43:33,963][__main__][INFO] - Iteration 37 took 50s (29.55% Gen, 68.60% Train). Generation: 15s, Training: 34s. Estimated remaining time: 41h 53m 16s. Estimated total time: 42h 27m 8s. Time estimates for 10 more iterations: 8m 29s, 100 more iterations: 1h 24m 54s, 500 more iterations: 7h 4m 31s. [2025-11-12 22:43:33,966][__main__][INFO] - Starting iteration 37. [2025-11-12 22:43:34,435][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-12 22:43:34,435][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:43:42,949][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:43:49,190][__main__][INFO] - Number of regex retries in iteration 37: 1 [2025-11-12 22:43:49,190][__main__][INFO] - agents played in iteration 37 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:43:50,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:43:50,048][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:43:50,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:43:50,095][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:43:50,095][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:43:50,096][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:43:50,686][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:43:51,149][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:43:51,653][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:43:52,150][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:43:52,648][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:43:53,144][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:43:53,643][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:43:54,139][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:43:54,637][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:43:55,134][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:43:55,634][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:43:56,132][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:43:56,628][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:43:57,124][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:43:57,622][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:43:58,117][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:43:58,617][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:43:59,114][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:43:59,611][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:44:00,111][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:44:00,610][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:44:01,108][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:44:01,608][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:44:02,105][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:44:02,604][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:44:03,105][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:44:03,602][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:44:04,106][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:44:04,599][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:44:05,094][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:44:05,593][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:44:06,088][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:44:06,590][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:44:07,087][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:44:07,605][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:44:08,106][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:44:08,603][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:44:09,099][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:44:09,596][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:44:10,098][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:44:10,609][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:44:11,106][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:44:11,607][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:44:12,106][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:44:12,605][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:44:13,109][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:44:13,602][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:44:14,100][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:44:14,609][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:44:15,105][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:44:15,604][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:44:16,102][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:44:16,598][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:44:17,102][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:44:17,599][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:44:18,097][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:44:18,602][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:44:19,099][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:44:19,598][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:44:20,095][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:44:20,592][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:44:21,102][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:44:21,601][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:44:22,100][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:44:22,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9879 tokens. [2025-11-12 22:44:23,259][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 61.91%, ΔTime: 00:00:32 [2025-11-12 22:44:24,046][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:44:24,048][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:44:24,050][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:44:25,006][__main__][INFO] - Iteration 38 took 50s (29.18% Gen, 68.93% Train). Generation: 14s, Training: 34s. Estimated remaining time: 41h 33m 52s. Estimated total time: 42h 8m 35s. Time estimates for 10 more iterations: 8m 25s, 100 more iterations: 1h 24m 17s, 500 more iterations: 7h 1m 25s. [2025-11-12 22:44:25,008][__main__][INFO] - Starting iteration 38. [2025-11-12 22:44:25,474][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-12 22:44:25,474][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:44:39,548][__main__][INFO] - Number of regex retries in iteration 38: 0 [2025-11-12 22:44:39,548][__main__][INFO] - agents played in iteration 38 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:44:40,398][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:44:40,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:44:40,452][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:44:40,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:44:40,475][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:44:40,475][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:44:41,087][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:44:41,543][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:44:42,046][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:44:42,544][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:44:43,046][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:44:43,547][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:44:44,046][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:44:44,546][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:44:45,047][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:44:45,544][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:44:46,043][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:44:46,539][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:44:47,038][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:44:47,539][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:44:48,037][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:44:48,540][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:44:49,041][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:44:49,542][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:44:50,046][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:44:50,546][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:44:51,045][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:44:51,546][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:44:52,044][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:44:52,542][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:44:53,040][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:44:53,540][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:44:54,041][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:44:54,543][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:44:55,041][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:44:55,539][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:44:56,038][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:44:56,537][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:44:57,037][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:44:57,536][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:44:58,036][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:44:58,534][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:44:59,032][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:44:59,531][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:45:00,028][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:45:00,531][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:45:01,034][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:45:01,534][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:45:02,033][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:45:02,530][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:45:03,034][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:45:03,534][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:45:04,030][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:45:04,528][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:45:05,026][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:45:05,523][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:45:06,022][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:45:06,520][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:45:07,019][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:45:07,518][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:45:08,019][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:45:08,515][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:45:09,013][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:45:09,514][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:45:10,014][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:45:10,511][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:45:11,017][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:45:11,514][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:45:12,013][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:45:12,516][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:45:13,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9939 tokens. [2025-11-12 22:45:13,715][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 61.92%, ΔTime: 00:00:32 [2025-11-12 22:45:14,483][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:45:14,484][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:45:14,486][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:45:15,553][__main__][INFO] - Iteration 39 took 50s (28.10% Gen, 69.76% Train). Generation: 14s, Training: 34s. Estimated remaining time: 41h 8m 26s. Estimated total time: 41h 44m 0s. Time estimates for 10 more iterations: 8m 20s, 100 more iterations: 1h 23m 28s, 500 more iterations: 6h 57m 20s. [2025-11-12 22:45:15,555][__main__][INFO] - Starting iteration 39. [2025-11-12 22:45:16,036][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-12 22:45:16,037][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:45:23,080][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls Given Bob's per-item values, he values hats the least and balls the most. Since I value hats and books more than balls, and considering the random assignment, I propose keeping all 10 hats. This maximizes my potential points from hats, given their higher value to me. I will not propose any books or balls since I expect Bob to take those due to their higher value to him, and allocating them would dilute my potential points from hats. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:45:25,695][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:45:30,798][__main__][INFO] - Number of regex retries in iteration 39: 2 [2025-11-12 22:45:30,799][__main__][INFO] - agents played in iteration 39 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:45:31,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:45:31,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:45:31,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:45:31,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:45:31,711][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:45:31,712][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:45:32,320][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:45:32,776][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:45:33,277][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:45:33,776][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:45:34,275][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:45:34,775][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:45:35,273][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:45:35,869][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:45:36,374][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:45:36,875][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:45:37,374][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:45:37,872][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:45:38,368][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:45:38,868][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:45:39,369][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:45:39,866][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:45:40,364][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:45:40,860][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:45:41,359][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:45:41,856][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:45:42,354][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:45:42,851][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:45:43,348][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:45:43,846][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:45:44,344][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:45:44,841][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:45:45,345][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:45:45,840][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:45:46,338][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:45:46,838][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:45:47,337][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:45:47,838][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:45:48,337][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:45:48,835][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:45:49,337][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:45:49,836][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:45:50,336][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:45:50,834][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:45:51,335][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:45:51,839][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:45:52,340][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:45:52,840][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:45:53,341][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:45:53,840][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:45:54,342][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:45:54,842][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:45:55,344][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:45:55,846][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:45:56,345][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:45:56,844][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:45:57,345][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:45:57,846][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:45:58,349][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:45:58,850][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:45:59,354][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:45:59,859][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:46:00,358][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:46:00,858][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:46:01,358][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:46:01,858][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:46:02,360][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:46:02,860][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:46:03,359][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:46:03,861][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:46:04,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10086 tokens. [2025-11-12 22:46:05,085][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.06%, Current % of VRAM taken: 58.30%, Block Peak % of device VRAM: 62.01%, ΔTime: 00:00:32 [2025-11-12 22:46:05,869][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:46:05,871][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:46:05,873][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:46:06,808][__main__][INFO] - Iteration 40 took 50s (29.07% Gen, 69.08% Train). Generation: 14s, Training: 35s. Estimated remaining time: 41h 42m 12s. Estimated total time: 42h 18m 38s. Time estimates for 10 more iterations: 8m 27s, 100 more iterations: 1h 24m 37s, 500 more iterations: 7h 3m 6s. [2025-11-12 22:46:06,810][__main__][INFO] - Starting iteration 40. [2025-11-12 22:46:07,295][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2025-11-12 22:46:07,295][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:46:08,545][mllm.models.large_language_model_local][WARNING] - Response Proposal: 5 hats, 5 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:46:20,929][__main__][INFO] - Number of regex retries in iteration 40: 1 [2025-11-12 22:46:20,930][__main__][INFO] - agents played in iteration 40 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:46:21,812][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:46:21,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:46:21,859][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:46:21,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:46:21,881][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:46:21,882][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:46:22,483][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:46:22,938][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:46:23,440][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:46:23,940][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:46:24,444][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:46:24,946][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:46:25,445][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:46:25,944][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:46:26,443][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:46:26,948][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:46:27,447][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:46:27,946][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:46:28,448][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:46:28,946][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:46:29,448][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:46:29,949][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:46:30,447][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:46:30,945][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:46:31,442][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:46:31,940][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:46:32,454][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:46:32,952][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:46:33,455][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:46:33,958][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:46:34,458][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:46:34,958][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:46:35,464][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:46:35,966][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:46:36,467][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:46:36,966][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:46:37,467][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:46:37,966][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:46:38,464][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:46:38,962][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:46:39,463][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:46:39,961][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:46:40,464][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:46:40,965][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:46:41,465][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:46:41,968][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:46:42,469][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:46:42,974][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:46:43,475][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:46:43,975][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:46:44,481][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:46:44,983][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:46:45,490][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:46:45,993][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:46:46,498][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:46:46,998][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:46:47,501][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:46:48,001][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:46:48,504][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:46:49,004][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:46:49,506][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:46:50,017][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:46:50,517][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:46:51,042][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:46:51,542][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:46:52,039][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:46:52,552][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:46:53,052][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:46:53,555][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:46:54,055][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:46:54,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10086 tokens. [2025-11-12 22:46:55,244][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 61.89%, ΔTime: 00:00:32 [2025-11-12 22:46:56,038][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:46:56,039][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:46:56,041][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:46:57,928][__main__][INFO] - Iteration 41 took 50s (26.93% Gen, 69.34% Train). Generation: 13s, Training: 35s. Estimated remaining time: 41h 34m 23s. Estimated total time: 42h 11m 39s. Time estimates for 10 more iterations: 8m 26s, 100 more iterations: 1h 24m 23s, 500 more iterations: 7h 1m 56s. [2025-11-12 22:46:57,930][__main__][INFO] - Starting iteration 41. [2025-11-12 22:46:58,431][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-12 22:46:58,431][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:47:13,186][__main__][INFO] - Number of regex retries in iteration 41: 0 [2025-11-12 22:47:13,186][__main__][INFO] - agents played in iteration 41 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:47:13,986][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:47:14,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:47:14,036][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:47:14,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:47:14,059][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:47:14,060][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:47:14,667][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:47:15,124][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:47:15,628][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:47:16,156][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:47:16,654][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:47:17,151][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:47:17,652][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:47:18,151][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:47:18,651][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:47:19,148][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:47:19,643][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:47:20,155][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:47:20,651][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:47:21,149][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:47:21,645][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:47:22,141][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:47:22,648][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:47:23,145][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:47:23,641][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:47:24,150][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:47:24,646][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:47:25,140][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:47:25,638][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:47:26,133][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:47:26,638][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:47:27,135][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:47:27,634][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:47:28,150][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:47:28,648][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:47:29,145][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:47:29,642][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:47:30,140][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:47:30,651][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:47:31,146][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:47:31,642][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:47:32,141][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:47:32,641][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:47:33,151][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:47:33,649][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:47:34,149][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:47:34,657][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:47:35,157][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:47:35,656][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:47:36,157][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:47:36,656][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:47:37,159][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:47:37,658][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:47:38,157][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:47:38,657][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:47:39,157][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:47:39,657][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:47:40,158][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:47:40,660][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:47:41,163][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:47:41,662][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:47:42,162][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:47:42,661][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:47:43,160][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:47:43,663][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:47:44,160][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:47:44,658][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:47:45,159][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:47:45,658][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:47:46,157][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:47:46,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9935 tokens. [2025-11-12 22:47:47,350][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.04%, ΔTime: 00:00:32 [2025-11-12 22:47:48,147][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:47:48,148][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:47:48,150][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:47:49,078][__main__][INFO] - Iteration 42 took 50s (29.13% Gen, 69.03% Train). Generation: 14s, Training: 34s. Estimated remaining time: 41h 34m 15s. Estimated total time: 42h 12m 23s. Time estimates for 10 more iterations: 8m 26s, 100 more iterations: 1h 24m 24s, 500 more iterations: 7h 2m 3s. [2025-11-12 22:47:49,080][__main__][INFO] - Starting iteration 42. [2025-11-12 22:47:49,556][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-12 22:47:49,557][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:48:03,729][__main__][INFO] - Number of regex retries in iteration 42: 0 [2025-11-12 22:48:03,730][__main__][INFO] - agents played in iteration 42 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:48:04,575][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:48:04,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:48:04,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:48:04,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:48:04,655][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:48:04,656][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:48:05,316][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:48:05,771][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:48:06,282][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:48:06,781][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:48:07,282][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:48:07,780][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:48:08,280][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:48:08,790][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:48:09,288][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:48:09,786][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:48:10,285][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:48:10,782][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:48:11,282][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:48:11,779][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:48:12,277][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:48:12,781][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:48:13,281][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:48:13,778][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:48:14,274][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:48:14,768][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:48:15,269][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:48:15,766][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:48:16,262][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:48:16,760][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:48:17,257][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:48:17,756][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:48:18,253][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:48:18,749][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:48:19,263][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:48:19,759][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:48:20,256][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:48:20,755][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:48:21,252][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:48:21,759][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:48:22,254][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:48:22,751][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:48:23,263][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:48:23,763][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:48:24,265][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:48:24,764][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:48:25,262][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:48:25,767][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:48:26,265][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:48:26,763][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:48:27,266][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:48:27,764][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:48:28,264][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:48:28,765][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:48:29,262][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:48:29,775][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:48:30,276][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:48:30,775][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:48:31,275][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:48:31,773][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:48:32,278][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:48:32,776][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:48:33,278][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:48:33,776][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:48:34,274][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:48:34,772][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:48:35,269][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:48:35,766][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:48:36,278][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:48:36,776][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:48:37,276][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9886 tokens. [2025-11-12 22:48:37,956][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.05%, Current % of VRAM taken: 58.30%, Block Peak % of device VRAM: 61.95%, ΔTime: 00:00:32 [2025-11-12 22:48:38,752][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:48:38,753][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:48:38,755][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:48:39,719][__main__][INFO] - Iteration 43 took 50s (28.25% Gen, 69.82% Train). Generation: 14s, Training: 35s. Estimated remaining time: 41h 9m 9s. Estimated total time: 41h 48m 7s. Time estimates for 10 more iterations: 8m 21s, 100 more iterations: 1h 23m 36s, 500 more iterations: 6h 58m 1s. [2025-11-12 22:48:39,721][__main__][INFO] - Starting iteration 43. [2025-11-12 22:48:40,228][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-12 22:48:40,229][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:48:54,862][__main__][INFO] - Number of regex retries in iteration 43: 0 [2025-11-12 22:48:54,863][__main__][INFO] - agents played in iteration 43 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:48:55,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:48:55,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:48:55,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:48:55,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:48:55,797][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:48:55,798][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:48:56,412][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:48:56,869][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:48:57,374][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:48:57,879][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:48:58,379][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:48:58,883][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:48:59,380][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:48:59,877][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:49:00,375][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:49:00,871][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:49:01,378][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:49:01,873][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:49:02,370][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:49:02,882][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:49:03,378][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:49:03,878][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:49:04,376][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:49:04,876][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:49:05,385][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:49:05,882][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:49:06,379][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:49:06,882][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:49:07,379][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:49:07,878][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:49:08,375][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:49:08,873][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:49:09,384][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:49:09,881][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:49:10,379][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:49:10,877][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:49:11,375][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:49:11,879][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:49:12,375][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:49:12,874][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:49:13,391][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:49:13,889][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:49:14,389][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:49:14,888][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:49:15,389][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:49:15,893][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:49:16,394][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:49:16,927][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:49:17,442][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:49:17,943][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:49:18,445][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:49:18,946][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:49:19,444][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:49:19,962][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:49:20,462][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:49:20,961][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:49:21,462][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:49:21,960][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:49:22,479][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:49:22,980][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:49:23,479][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:49:23,978][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:49:24,475][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:49:24,980][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:49:25,480][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:49:25,979][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:49:26,489][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:49:26,990][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:49:27,497][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:49:27,995][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:49:28,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9989 tokens. [2025-11-12 22:49:29,157][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.04%, ΔTime: 00:00:32 [2025-11-12 22:49:29,917][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:49:29,918][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:49:29,920][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:49:30,872][__main__][INFO] - Iteration 44 took 50s (28.89% Gen, 69.22% Train). Generation: 14s, Training: 35s. Estimated remaining time: 41h 32m 25s. Estimated total time: 42h 12m 14s. Time estimates for 10 more iterations: 8m 26s, 100 more iterations: 1h 24m 24s, 500 more iterations: 7h 2m 2s. [2025-11-12 22:49:30,874][__main__][INFO] - Starting iteration 44. [2025-11-12 22:49:31,375][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-12 22:49:31,375][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:49:33,193][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:49:46,288][__main__][INFO] - Number of regex retries in iteration 44: 1 [2025-11-12 22:49:46,289][__main__][INFO] - agents played in iteration 44 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:49:47,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:49:47,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:49:47,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:49:47,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:49:47,210][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:49:47,211][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:49:47,822][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:49:48,276][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:49:48,795][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:49:49,302][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:49:49,806][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:49:50,307][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:49:50,804][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:49:51,308][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:49:51,805][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:49:52,302][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:49:52,801][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:49:53,299][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:49:53,799][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:49:54,296][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:49:54,793][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:49:55,297][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:49:55,793][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:49:56,292][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:49:56,790][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:49:57,288][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:49:57,794][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:49:58,291][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:49:58,790][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:49:59,298][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:49:59,796][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:50:00,296][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:50:00,789][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:50:01,286][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:50:01,790][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:50:02,286][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:50:02,783][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:50:03,279][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:50:03,777][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:50:04,276][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:50:04,774][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:50:05,275][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:50:05,789][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:50:06,290][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:50:06,796][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:50:07,297][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:50:07,801][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:50:08,306][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:50:08,808][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:50:09,306][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:50:09,804][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:50:10,303][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:50:10,799][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:50:11,298][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:50:11,797][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:50:12,300][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:50:12,798][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:50:13,297][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:50:13,797][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:50:14,296][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:50:14,801][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:50:15,301][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:50:15,799][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:50:16,298][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:50:16,794][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:50:17,292][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:50:17,790][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:50:18,290][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:50:18,798][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:50:19,299][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:50:19,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10060 tokens. [2025-11-12 22:50:20,439][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.02%, Current % of VRAM taken: 58.27%, Block Peak % of device VRAM: 61.88%, ΔTime: 00:00:32 [2025-11-12 22:50:21,225][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:50:21,227][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:50:21,229][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:50:22,162][__main__][INFO] - Iteration 45 took 50s (29.36% Gen, 68.80% Train). Generation: 14s, Training: 34s. Estimated remaining time: 41h 38m 43s. Estimated total time: 42h 19m 23s. Time estimates for 10 more iterations: 8m 27s, 100 more iterations: 1h 24m 38s, 500 more iterations: 7h 3m 13s. [2025-11-12 22:50:22,164][__main__][INFO] - Starting iteration 45. [2025-11-12 22:50:22,645][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-12 22:50:22,646][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:50:24,188][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:50:35,288][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:50:37,713][__main__][INFO] - Number of regex retries in iteration 45: 2 [2025-11-12 22:50:37,713][__main__][INFO] - agents played in iteration 45 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:50:38,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:50:38,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:50:38,592][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:50:38,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:50:38,616][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:50:38,617][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:50:39,234][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:50:39,690][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:50:40,208][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:50:40,711][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:50:41,212][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:50:41,711][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:50:42,208][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:50:42,708][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:50:43,203][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:50:43,700][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:50:44,201][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:50:44,697][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:50:45,194][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:50:45,702][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:50:46,199][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:50:46,695][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:50:47,191][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:50:47,691][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:50:48,202][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:50:48,702][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:50:49,198][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:50:49,728][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:50:50,229][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:50:50,729][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:50:51,234][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:50:51,731][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:50:52,228][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:50:52,724][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:50:53,221][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:50:53,719][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:50:54,216][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:50:54,714][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:50:55,212][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:50:55,712][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:50:56,213][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:50:56,715][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:50:57,215][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:50:57,718][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:50:58,217][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:50:58,716][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:50:59,217][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:50:59,718][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:51:00,221][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:51:00,722][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:51:01,221][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:51:01,725][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:51:02,223][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:51:02,721][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:51:03,225][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:51:03,726][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:51:04,226][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:51:04,726][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:51:05,224][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:51:05,727][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:51:06,224][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:51:06,729][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:51:07,227][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:51:07,725][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:51:08,226][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:51:08,723][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:51:09,219][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:51:09,717][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:51:10,214][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:51:10,710][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:51:11,215][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10006 tokens. [2025-11-12 22:51:11,894][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.04%, Current % of VRAM taken: 58.29%, Block Peak % of device VRAM: 62.00%, ΔTime: 00:00:32 [2025-11-12 22:51:12,658][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:51:12,660][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:51:12,661][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:51:13,669][__main__][INFO] - Iteration 46 took 51s (29.53% Gen, 68.49% Train). Generation: 15s, Training: 34s. Estimated remaining time: 41h 49m 40s. Estimated total time: 42h 31m 12s. Time estimates for 10 more iterations: 8m 30s, 100 more iterations: 1h 25m 2s, 500 more iterations: 7h 5m 12s. [2025-11-12 22:51:13,671][__main__][INFO] - Starting iteration 46. [2025-11-12 22:51:14,142][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-12 22:51:14,143][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:51:28,460][__main__][INFO] - Number of regex retries in iteration 46: 0 [2025-11-12 22:51:28,461][__main__][INFO] - agents played in iteration 46 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:51:29,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:51:29,334][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:51:29,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:51:29,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:51:29,379][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:51:29,379][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:51:29,988][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:51:30,446][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:51:30,959][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:51:31,466][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:51:31,972][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:51:32,472][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:51:32,973][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:51:33,488][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:51:33,986][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:51:34,486][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:51:34,989][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:51:35,491][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:51:36,017][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:51:36,517][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:51:37,017][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:51:37,520][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:51:38,022][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:51:38,519][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:51:39,014][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:51:39,511][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:51:40,012][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:51:40,509][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:51:41,005][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:51:41,505][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:51:42,002][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:51:42,500][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:51:42,997][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:51:43,496][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:51:44,008][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:51:44,505][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:51:45,004][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:51:45,502][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:51:46,003][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:51:46,508][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:51:47,005][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:51:47,503][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:51:48,004][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:51:48,504][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:51:49,005][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:51:49,506][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:51:50,004][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:51:50,519][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:51:51,020][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:51:51,522][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:51:52,021][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:51:52,521][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:51:53,021][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:51:53,520][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:51:54,019][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:51:54,522][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:51:55,023][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:51:55,523][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:51:56,022][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:51:56,519][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:51:57,023][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:51:57,522][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:51:58,022][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:51:58,520][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:51:59,023][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:51:59,524][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:52:00,022][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:52:00,521][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:52:01,024][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:52:01,523][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:52:02,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9997 tokens. [2025-11-12 22:52:02,666][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 61.98%, ΔTime: 00:00:32 [2025-11-12 22:52:03,434][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:52:03,436][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:52:03,438][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:52:04,384][__main__][INFO] - Iteration 47 took 50s (28.50% Gen, 69.62% Train). Generation: 14s, Training: 34s. Estimated remaining time: 41h 9m 43s. Estimated total time: 41h 52m 6s. Time estimates for 10 more iterations: 8m 22s, 100 more iterations: 1h 23m 44s, 500 more iterations: 6h 58m 41s. [2025-11-12 22:52:04,386][__main__][INFO] - Starting iteration 47. [2025-11-12 22:52:04,862][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-12 22:52:04,862][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:52:10,294][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:52:19,467][__main__][INFO] - Number of regex retries in iteration 47: 1 [2025-11-12 22:52:19,468][__main__][INFO] - agents played in iteration 47 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:52:20,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:52:20,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:52:20,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:52:20,317][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:52:20,318][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:52:20,319][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:52:20,931][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:52:21,386][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:52:21,890][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:52:22,391][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:52:22,894][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:52:23,395][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:52:23,893][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:52:24,390][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:52:24,889][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:52:25,385][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:52:25,880][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:52:26,388][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:52:26,885][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:52:27,378][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:52:27,873][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:52:28,368][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:52:28,866][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:52:29,362][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:52:29,859][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:52:30,356][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:52:30,853][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:52:31,349][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:52:31,846][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:52:32,342][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:52:32,842][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:52:33,339][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:52:33,838][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:52:34,337][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:52:34,836][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:52:35,336][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:52:35,835][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:52:36,334][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:52:36,831][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:52:37,328][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:52:37,826][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:52:38,324][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:52:38,823][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:52:39,327][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:52:39,827][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:52:40,328][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:52:40,827][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:52:41,323][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:52:41,828][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:52:42,331][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:52:42,834][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:52:43,339][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:52:43,840][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:52:44,338][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:52:44,845][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:52:45,345][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:52:45,843][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:52:46,341][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:52:46,837][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:52:47,336][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:52:47,832][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:52:48,334][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:52:48,834][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:52:49,333][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:52:49,837][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:52:50,340][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:52:50,841][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:52:51,340][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:52:51,840][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:52:52,343][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:52:52,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9959 tokens. [2025-11-12 22:52:53,499][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.06%, Current % of VRAM taken: 58.31%, Block Peak % of device VRAM: 62.04%, ΔTime: 00:00:32 [2025-11-12 22:52:54,254][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:52:54,256][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:52:54,258][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:52:55,200][__main__][INFO] - Iteration 48 took 50s (29.01% Gen, 69.11% Train). Generation: 14s, Training: 34s. Estimated remaining time: 41h 13m 40s. Estimated total time: 41h 56m 54s. Time estimates for 10 more iterations: 8m 23s, 100 more iterations: 1h 23m 53s, 500 more iterations: 6h 59m 29s. [2025-11-12 22:52:55,202][__main__][INFO] - Starting iteration 48. [2025-11-12 22:52:55,785][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-12 22:52:55,786][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:52:57,413][mllm.models.large_language_model_local][WARNING] - Response Proposal: 5 hats, 5 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:53:10,652][__main__][INFO] - Number of regex retries in iteration 48: 1 [2025-11-12 22:53:10,653][__main__][INFO] - agents played in iteration 48 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:53:11,431][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:53:11,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:53:11,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:53:11,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:53:11,498][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:53:11,498][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:53:12,139][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:53:12,596][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:53:13,103][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:53:13,607][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:53:14,107][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:53:14,615][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:53:15,117][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:53:15,616][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:53:16,126][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:53:16,630][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:53:17,134][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:53:17,642][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:53:18,141][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:53:18,644][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:53:19,146][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:53:19,646][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:53:20,145][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:53:20,644][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:53:21,142][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:53:21,639][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:53:22,136][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:53:22,635][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:53:23,131][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:53:23,627][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:53:24,125][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:53:24,621][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:53:25,119][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:53:25,617][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:53:26,113][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:53:26,625][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:53:27,122][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:53:27,619][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:53:28,118][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:53:28,616][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:53:29,124][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:53:29,620][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:53:30,120][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:53:30,622][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:53:31,119][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:53:31,622][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:53:32,122][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:53:32,623][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:53:33,137][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:53:33,638][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:53:34,142][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:53:34,649][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:53:35,149][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:53:35,649][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:53:36,149][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:53:36,650][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:53:37,151][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:53:37,649][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:53:38,151][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:53:38,652][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:53:39,154][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:53:39,656][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:53:40,156][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:53:40,657][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:53:41,163][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:53:41,662][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:53:42,160][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:53:42,663][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:53:43,164][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:53:43,667][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:53:44,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10134 tokens. [2025-11-12 22:53:44,841][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.32%, Block Peak % of device VRAM: 61.99%, ΔTime: 00:00:32 [2025-11-12 22:53:45,610][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:53:45,612][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:53:45,614][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:53:46,579][__main__][INFO] - Iteration 49 took 50s (29.27% Gen, 68.83% Train). Generation: 14s, Training: 34s. Estimated remaining time: 41h 35m 38s. Estimated total time: 42h 19m 43s. Time estimates for 10 more iterations: 8m 27s, 100 more iterations: 1h 24m 39s, 500 more iterations: 7h 3m 17s. [2025-11-12 22:53:46,581][__main__][INFO] - Starting iteration 49. [2025-11-12 22:53:47,086][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-12 22:53:47,087][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:54:02,096][__main__][INFO] - Number of regex retries in iteration 49: 0 [2025-11-12 22:54:02,096][__main__][INFO] - agents played in iteration 49 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:54:02,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:54:02,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:54:02,937][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:54:02,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:54:02,960][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:54:02,961][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:54:03,578][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:54:04,035][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:54:04,544][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:54:05,049][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:54:05,550][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:54:06,049][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:54:06,546][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:54:07,043][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:54:07,545][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:54:08,048][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:54:08,546][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:54:09,046][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:54:09,543][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:54:10,046][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:54:10,544][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:54:11,043][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:54:11,542][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:54:12,040][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:54:12,540][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:54:13,035][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:54:13,535][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:54:14,041][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:54:14,539][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:54:15,037][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:54:15,533][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:54:16,030][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:54:16,536][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:54:17,036][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:54:17,535][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:54:18,046][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:54:18,549][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:54:19,050][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:54:19,547][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:54:20,042][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:54:20,545][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:54:21,041][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:54:21,539][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:54:22,043][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:54:22,541][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:54:23,042][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:54:23,538][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:54:24,037][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:54:24,539][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:54:25,037][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:54:25,538][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:54:26,045][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:54:26,544][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:54:27,048][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:54:27,543][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:54:28,047][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:54:28,548][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:54:29,049][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:54:29,553][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:54:30,052][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:54:30,550][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:54:31,053][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:54:31,553][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:54:32,052][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:54:32,551][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:54:33,051][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:54:33,558][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:54:34,061][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:54:34,560][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:54:35,059][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:54:35,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10091 tokens. [2025-11-12 22:54:36,248][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.03%, Current % of VRAM taken: 58.28%, Block Peak % of device VRAM: 61.96%, ΔTime: 00:00:32 [2025-11-12 22:54:37,018][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:54:37,020][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:54:37,021][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:54:37,976][__main__][INFO] - Iteration 50 took 50s (29.49% Gen, 68.63% Train). Generation: 15s, Training: 34s. Estimated remaining time: 41h 39m 32s. Estimated total time: 42h 24m 28s. Time estimates for 10 more iterations: 8m 28s, 100 more iterations: 1h 24m 48s, 500 more iterations: 7h 4m 4s. [2025-11-12 22:54:37,978][__main__][INFO] - Starting iteration 50. [2025-11-12 22:54:38,477][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2025-11-12 22:54:38,478][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:54:53,283][__main__][INFO] - Number of regex retries in iteration 50: 0 [2025-11-12 22:54:53,284][__main__][INFO] - agents played in iteration 50 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:54:54,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:54:54,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:54:54,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:54:54,195][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:54:54,195][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:54:54,196][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:54:54,821][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:54:55,279][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:54:55,786][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:54:56,287][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:54:56,790][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:54:57,296][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:54:57,795][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:54:58,293][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:54:58,796][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:54:59,294][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:54:59,792][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:55:00,291][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:55:00,787][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:55:01,286][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:55:01,784][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:55:02,282][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:55:02,780][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:55:03,278][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:55:03,778][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:55:04,275][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:55:04,772][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:55:05,275][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:55:05,776][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:55:06,274][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:55:06,774][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:55:07,270][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:55:07,768][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:55:08,265][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:55:08,762][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:55:09,261][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:55:09,759][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:55:10,261][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:55:10,757][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:55:11,255][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:55:11,764][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:55:12,264][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:55:12,764][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:55:13,276][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:55:13,779][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:55:14,284][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:55:14,784][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:55:15,281][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:55:15,811][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:55:16,318][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:55:16,829][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:55:17,337][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:55:17,838][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:55:18,344][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:55:18,847][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:55:19,350][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:55:19,849][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:55:20,353][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:55:20,860][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:55:21,360][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:55:21,863][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:55:22,363][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:55:22,865][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:55:23,367][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:55:23,864][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:55:24,364][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:55:24,868][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:55:25,367][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:55:25,864][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:55:26,363][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:55:26,864][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10156 tokens. [2025-11-12 22:55:27,516][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:32 [2025-11-12 22:55:28,267][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:55:28,269][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:55:28,271][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:55:30,149][__main__][INFO] - Iteration 51 took 51s (28.65% Gen, 67.71% Train). Generation: 14s, Training: 34s. Estimated remaining time: 42h 17m 48s. Estimated total time: 43h 3m 37s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 7s, 500 more iterations: 7h 10m 36s. [2025-11-12 22:55:30,151][__main__][INFO] - Starting iteration 51. [2025-11-12 22:55:30,623][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-12 22:55:30,623][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:55:45,904][__main__][INFO] - Number of regex retries in iteration 51: 0 [2025-11-12 22:55:45,905][__main__][INFO] - agents played in iteration 51 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:55:46,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:55:46,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:55:46,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:55:46,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:55:46,820][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:55:46,821][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:55:47,430][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:55:47,883][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:55:48,388][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:55:48,888][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:55:49,392][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:55:49,890][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:55:50,386][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:55:50,883][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:55:51,379][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:55:51,878][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:55:52,381][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:55:52,878][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:55:53,378][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:55:53,879][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:55:54,377][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:55:54,881][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:55:55,379][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:55:55,878][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:55:56,381][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:55:56,881][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:55:57,380][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:55:57,878][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:55:58,375][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:55:58,877][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:55:59,377][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:55:59,876][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:56:00,377][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:56:00,875][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:56:01,372][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:56:01,871][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:56:02,368][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:56:02,866][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:56:03,367][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:56:03,866][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:56:04,402][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:56:04,901][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:56:05,404][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:56:05,906][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:56:06,407][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:56:06,907][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:56:07,405][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:56:07,905][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:56:08,414][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:56:08,922][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:56:09,424][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:56:09,926][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:56:10,425][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:56:10,932][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:56:11,433][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:56:11,931][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:56:12,443][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:56:12,941][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:56:13,444][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:56:13,946][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:56:14,442][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:56:14,943][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:56:15,443][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:56:15,943][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:56:16,442][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:56:16,941][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:56:17,466][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:56:17,966][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:56:18,469][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:56:18,969][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:56:19,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10156 tokens. [2025-11-12 22:56:20,112][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.08%, ΔTime: 00:00:32 [2025-11-12 22:56:20,881][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:56:20,884][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:56:20,886][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:56:21,843][__main__][INFO] - Iteration 52 took 51s (29.83% Gen, 68.30% Train). Generation: 15s, Training: 34s. Estimated remaining time: 41h 54m 21s. Estimated total time: 42h 41m 1s. Time estimates for 10 more iterations: 8m 32s, 100 more iterations: 1h 25m 22s, 500 more iterations: 7h 6m 50s. [2025-11-12 22:56:21,845][__main__][INFO] - Starting iteration 52. [2025-11-12 22:56:22,323][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-12 22:56:22,324][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:56:24,799][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:56:36,672][__main__][INFO] - Number of regex retries in iteration 52: 1 [2025-11-12 22:56:36,673][__main__][INFO] - agents played in iteration 52 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:56:37,479][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:56:37,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:56:37,532][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:56:37,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:56:37,555][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:56:37,556][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:56:38,179][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:56:38,635][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:56:39,141][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:56:39,666][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:56:40,168][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:56:40,673][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:56:41,181][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:56:41,680][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:56:42,180][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:56:42,675][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:56:43,172][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:56:43,676][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:56:44,173][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:56:44,670][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:56:45,169][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:56:45,666][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:56:46,165][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:56:46,663][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:56:47,161][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:56:47,672][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:56:48,170][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:56:48,667][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:56:49,164][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:56:49,661][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:56:50,168][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:56:50,665][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:56:51,163][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:56:51,660][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:56:52,160][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:56:52,658][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:56:53,175][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:56:53,674][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:56:54,176][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:56:54,674][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:56:55,172][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:56:55,679][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:56:56,179][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:56:56,675][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:56:57,174][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:56:57,671][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:56:58,171][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:56:58,670][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:56:59,172][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:56:59,681][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:57:00,183][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:57:00,685][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:57:01,183][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:57:01,683][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:57:02,186][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:57:02,686][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:57:03,184][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:57:03,689][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:57:04,187][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:57:04,685][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:57:05,183][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:57:05,685][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:57:06,187][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:57:06,686][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:57:07,185][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:57:07,684][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:57:08,185][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:57:08,688][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:57:09,187][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:57:09,687][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:57:10,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10027 tokens. [2025-11-12 22:57:10,846][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 61.94%, ΔTime: 00:00:32 [2025-11-12 22:57:11,641][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:57:11,643][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:57:11,645][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:57:12,600][__main__][INFO] - Iteration 53 took 50s (28.54% Gen, 69.56% Train). Generation: 14s, Training: 34s. Estimated remaining time: 41h 6m 24s. Estimated total time: 41h 53m 55s. Time estimates for 10 more iterations: 8m 22s, 100 more iterations: 1h 23m 47s, 500 more iterations: 6h 58m 59s. [2025-11-12 22:57:12,602][__main__][INFO] - Starting iteration 53. [2025-11-12 22:57:13,082][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-12 22:57:13,084][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:57:14,877][mllm.models.large_language_model_local][WARNING] - Response Propposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:57:27,300][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 1 did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:57:28,193][__main__][INFO] - Number of regex retries in iteration 53: 2 [2025-11-12 22:57:28,193][__main__][INFO] - agents played in iteration 53 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:57:28,996][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:57:29,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:57:29,046][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:57:29,068][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:57:29,069][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:57:29,070][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:57:29,704][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:57:30,166][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:57:30,674][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:57:31,187][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:57:31,689][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:57:32,199][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:57:32,702][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:57:33,208][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:57:33,708][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:57:34,208][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:57:34,708][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:57:35,207][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:57:35,707][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:57:36,205][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:57:36,708][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:57:37,208][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:57:37,709][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:57:38,207][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:57:38,704][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:57:39,204][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:57:39,701][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:57:40,198][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:57:40,700][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:57:41,198][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:57:41,697][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:57:42,198][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:57:42,697][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:57:43,197][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:57:43,696][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:57:44,196][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:57:44,698][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:57:45,200][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:57:45,704][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:57:46,203][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:57:46,706][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:57:47,211][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:57:47,711][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:57:48,217][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:57:48,718][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:57:49,220][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:57:49,724][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:57:50,224][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:57:50,724][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:57:51,227][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:57:51,726][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:57:52,228][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:57:52,729][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:57:53,229][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:57:53,733][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:57:54,237][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:57:54,738][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:57:55,242][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:57:55,743][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:57:56,243][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:57:56,749][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:57:57,247][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:57:57,748][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:57:58,247][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:57:58,747][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:57:59,248][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:57:59,747][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:58:00,247][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:58:00,747][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:58:01,246][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:58:01,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10186 tokens. [2025-11-12 22:58:02,447][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.09%, ΔTime: 00:00:32 [2025-11-12 22:58:03,244][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:58:03,246][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:58:03,248][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:58:04,192][__main__][INFO] - Iteration 54 took 51s (29.56% Gen, 68.59% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 47m 9s. Estimated total time: 42h 35m 31s. Time estimates for 10 more iterations: 8m 31s, 100 more iterations: 1h 25m 11s, 500 more iterations: 7h 5m 55s. [2025-11-12 22:58:04,194][__main__][INFO] - Starting iteration 54. [2025-11-12 22:58:04,683][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-12 22:58:04,683][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:58:06,769][mllm.models.large_language_model_local][WARNING] - Response Proposal: 5 hats, 10 books, 5 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 22:58:19,840][__main__][INFO] - Number of regex retries in iteration 54: 1 [2025-11-12 22:58:19,841][__main__][INFO] - agents played in iteration 54 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:58:20,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:58:20,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:58:20,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:58:20,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:58:20,692][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:58:20,693][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:58:21,302][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:58:21,760][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:58:22,271][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:58:22,776][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:58:23,278][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:58:23,789][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:58:24,291][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:58:24,793][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:58:25,301][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:58:25,799][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:58:26,295][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:58:26,790][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:58:27,285][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:58:27,791][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:58:28,285][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:58:28,782][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:58:29,279][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:58:29,778][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:58:30,277][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:58:30,780][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:58:31,279][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:58:31,797][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:58:32,297][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:58:32,797][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:58:33,297][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:58:33,791][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:58:34,294][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:58:34,789][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:58:35,286][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:58:35,799][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:58:36,302][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:58:36,803][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:58:37,302][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:58:37,803][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:58:38,307][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:58:38,809][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:58:39,310][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:58:39,811][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:58:40,311][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:58:40,815][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:58:41,321][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:58:41,821][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:58:42,326][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:58:42,827][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:58:43,335][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:58:43,839][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:58:44,340][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:58:44,843][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:58:45,344][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:58:45,843][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:58:46,343][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:58:46,841][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:58:47,340][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:58:47,840][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:58:48,339][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:58:48,839][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:58:49,340][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:58:49,844][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:58:50,343][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:58:50,843][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:58:51,346][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:58:51,848][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:58:52,348][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:58:52,850][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:58:53,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10085 tokens. [2025-11-12 22:58:54,066][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.06%, ΔTime: 00:00:32 [2025-11-12 22:58:54,831][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:58:54,833][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:58:54,835][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:58:55,759][__main__][INFO] - Iteration 55 took 51s (29.68% Gen, 68.51% Train). Generation: 15s, Training: 34s. Estimated remaining time: 41h 44m 37s. Estimated total time: 42h 33m 51s. Time estimates for 10 more iterations: 8m 30s, 100 more iterations: 1h 25m 7s, 500 more iterations: 7h 5m 38s. [2025-11-12 22:58:55,761][__main__][INFO] - Starting iteration 55. [2025-11-12 22:58:56,246][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-12 22:58:56,247][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 22:59:11,348][__main__][INFO] - Number of regex retries in iteration 55: 0 [2025-11-12 22:59:11,349][__main__][INFO] - agents played in iteration 55 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 22:59:12,186][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:59:12,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:59:12,251][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:59:12,276][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 22:59:12,276][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 22:59:12,277][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 22:59:12,882][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 22:59:13,340][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 22:59:13,851][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 22:59:14,354][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 22:59:14,856][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 22:59:15,362][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 22:59:15,862][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 22:59:16,376][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 22:59:16,877][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 22:59:17,375][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 22:59:17,877][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 22:59:18,376][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 22:59:18,890][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 22:59:19,387][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 22:59:19,884][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 22:59:20,401][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 22:59:20,900][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 22:59:21,399][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 22:59:21,898][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 22:59:22,397][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 22:59:22,907][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 22:59:23,406][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 22:59:23,903][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 22:59:24,398][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 22:59:24,897][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 22:59:25,398][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 22:59:25,896][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 22:59:26,396][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 22:59:26,907][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 22:59:27,408][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 22:59:27,909][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 22:59:28,409][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 22:59:28,911][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 22:59:29,423][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 22:59:29,927][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 22:59:30,436][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 22:59:30,942][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 22:59:31,445][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 22:59:31,949][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 22:59:32,458][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 22:59:32,962][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 22:59:33,464][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 22:59:33,968][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 22:59:34,465][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 22:59:34,965][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 22:59:35,463][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 22:59:35,973][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 22:59:36,471][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 22:59:36,967][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 22:59:37,480][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 22:59:37,979][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 22:59:38,478][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 22:59:38,978][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 22:59:39,477][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 22:59:39,991][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 22:59:40,489][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 22:59:40,987][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 22:59:41,488][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 22:59:41,995][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 22:59:42,500][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 22:59:43,000][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 22:59:43,500][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 22:59:44,002][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 22:59:44,503][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 22:59:45,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10251 tokens. [2025-11-12 22:59:45,651][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:32 [2025-11-12 22:59:46,425][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 22:59:46,426][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 22:59:46,428][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 22:59:47,341][__main__][INFO] - Iteration 56 took 51s (29.56% Gen, 68.65% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 44m 40s. Estimated total time: 42h 34m 46s. Time estimates for 10 more iterations: 8m 30s, 100 more iterations: 1h 25m 9s, 500 more iterations: 7h 5m 47s. [2025-11-12 22:59:47,344][__main__][INFO] - Starting iteration 56. [2025-11-12 22:59:47,857][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-12 22:59:47,857][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:00:03,460][__main__][INFO] - Number of regex retries in iteration 56: 0 [2025-11-12 23:00:03,461][__main__][INFO] - agents played in iteration 56 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:00:04,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:00:04,317][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:00:04,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:00:04,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:00:04,365][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:00:04,366][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:00:05,000][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:00:05,457][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:00:05,961][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:00:06,468][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:00:06,971][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:00:07,495][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:00:07,998][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:00:08,500][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:00:09,000][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:00:09,502][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:00:10,013][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:00:10,514][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:00:11,017][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:00:11,521][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:00:12,043][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:00:12,545][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:00:13,046][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:00:13,547][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:00:14,049][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:00:14,551][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:00:15,050][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:00:15,557][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:00:16,055][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:00:16,553][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:00:17,052][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:00:17,557][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:00:18,065][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:00:18,565][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:00:19,067][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:00:19,570][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:00:20,070][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:00:20,587][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:00:21,087][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:00:21,584][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:00:22,090][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:00:22,590][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:00:23,104][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:00:23,607][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:00:24,108][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:00:24,618][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:00:25,120][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:00:25,635][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:00:26,141][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:00:26,646][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:00:27,150][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:00:27,654][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:00:28,154][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:00:28,659][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:00:29,161][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:00:29,662][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:00:30,162][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:00:30,663][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:00:31,169][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:00:31,669][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:00:32,170][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:00:32,674][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:00:33,173][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:00:33,679][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:00:34,179][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:00:34,680][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:00:35,194][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:00:35,693][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:00:36,192][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:00:36,689][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:00:37,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10279 tokens. [2025-11-12 23:00:37,839][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.07%, Current % of VRAM taken: 58.32%, Block Peak % of device VRAM: 62.17%, ΔTime: 00:00:32 [2025-11-12 23:00:38,580][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:00:38,582][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:00:38,583][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:00:39,520][__main__][INFO] - Iteration 57 took 51s (30.20% Gen, 67.98% Train). Generation: 15s, Training: 35s. Estimated remaining time: 42h 12m 11s. Estimated total time: 43h 3m 9s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 6s, 500 more iterations: 7h 10m 31s. [2025-11-12 23:00:39,522][__main__][INFO] - Starting iteration 57. [2025-11-12 23:00:40,015][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-12 23:00:40,015][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:00:55,288][__main__][INFO] - Number of regex retries in iteration 57: 0 [2025-11-12 23:00:55,289][__main__][INFO] - agents played in iteration 57 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:00:56,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:00:56,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:00:56,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:00:56,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:00:56,349][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:00:56,350][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:00:56,961][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:00:57,424][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:00:57,930][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:00:58,435][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:00:58,938][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:00:59,438][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:00:59,940][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:01:00,439][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:01:00,937][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:01:01,445][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:01:01,946][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:01:02,447][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:01:02,950][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:01:03,450][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:01:03,950][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:01:04,447][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:01:04,945][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:01:05,450][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:01:05,948][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:01:06,447][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:01:06,945][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:01:07,448][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:01:07,949][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:01:08,450][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:01:08,950][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:01:09,474][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:01:09,982][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:01:10,482][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:01:10,997][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:01:11,498][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:01:12,020][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:01:12,525][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:01:13,024][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:01:13,529][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:01:14,032][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:01:14,538][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:01:15,039][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:01:15,544][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:01:16,048][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:01:16,548][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:01:17,050][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:01:17,557][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:01:18,061][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:01:18,560][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:01:19,061][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:01:19,561][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:01:20,071][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:01:20,571][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:01:21,072][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:01:21,574][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:01:22,077][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:01:22,579][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:01:23,080][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:01:23,582][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:01:24,092][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:01:24,594][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:01:25,102][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:01:25,611][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:01:26,111][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:01:26,625][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:01:27,126][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:01:27,628][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:01:28,136][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:01:28,639][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:01:29,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10201 tokens. [2025-11-12 23:01:29,786][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.09%, ΔTime: 00:00:32 [2025-11-12 23:01:30,603][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:01:30,605][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:01:30,607][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:01:31,535][__main__][INFO] - Iteration 58 took 51s (29.64% Gen, 68.55% Train). Generation: 15s, Training: 35s. Estimated remaining time: 42h 4m 10s. Estimated total time: 42h 56m 0s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 52s, 500 more iterations: 7h 9m 20s. [2025-11-12 23:01:31,537][__main__][INFO] - Starting iteration 58. [2025-11-12 23:01:32,019][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-12 23:01:32,020][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:01:46,686][__main__][INFO] - Number of regex retries in iteration 58: 0 [2025-11-12 23:01:46,687][__main__][INFO] - agents played in iteration 58 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:01:47,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:01:47,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:01:47,611][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:01:47,634][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:01:47,635][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:01:47,636][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:01:48,240][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:01:48,702][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:01:49,205][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:01:49,705][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:01:50,223][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:01:50,726][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:01:51,225][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:01:51,724][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:01:52,227][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:01:52,732][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:01:53,232][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:01:53,735][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:01:54,239][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:01:54,742][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:01:55,244][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:01:55,750][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:01:56,255][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:01:56,760][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:01:57,259][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:01:57,762][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:01:58,266][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:01:58,770][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:01:59,273][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:01:59,777][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:02:00,278][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:02:00,782][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:02:01,285][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:02:01,785][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:02:02,283][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:02:02,787][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:02:03,289][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:02:03,786][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:02:04,287][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:02:04,787][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:02:05,286][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:02:05,795][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:02:06,299][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:02:06,800][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:02:07,318][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:02:07,818][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:02:08,330][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:02:08,830][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:02:09,331][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:02:09,846][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:02:10,345][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:02:10,847][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:02:11,347][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:02:11,847][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:02:12,354][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:02:12,855][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:02:13,355][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:02:13,853][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:02:14,355][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:02:14,858][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:02:15,359][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:02:15,859][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:02:16,361][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:02:16,862][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:02:17,359][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:02:17,858][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:02:18,358][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:02:18,860][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:02:19,360][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:02:19,857][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:02:20,361][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10191 tokens. [2025-11-12 23:02:20,996][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 62.01%, ΔTime: 00:00:32 [2025-11-12 23:02:21,749][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:02:21,751][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:02:21,753][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:02:22,649][__main__][INFO] - Iteration 59 took 50s (28.97% Gen, 69.26% Train). Generation: 14s, Training: 35s. Estimated remaining time: 41h 18m 50s. Estimated total time: 42h 11m 31s. Time estimates for 10 more iterations: 8m 26s, 100 more iterations: 1h 24m 23s, 500 more iterations: 7h 1m 55s. [2025-11-12 23:02:22,651][__main__][INFO] - Starting iteration 59. [2025-11-12 23:02:23,144][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-12 23:02:23,145][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:02:38,684][__main__][INFO] - Number of regex retries in iteration 59: 0 [2025-11-12 23:02:38,685][__main__][INFO] - agents played in iteration 59 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:02:39,512][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:02:39,534][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:02:39,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:02:39,577][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:02:39,578][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:02:39,579][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:02:40,196][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:02:40,657][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:02:41,161][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:02:41,666][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:02:42,168][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:02:42,675][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:02:43,175][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:02:43,680][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:02:44,183][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:02:44,681][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:02:45,182][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:02:45,684][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:02:46,185][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:02:46,689][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:02:47,186][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:02:47,687][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:02:48,191][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:02:48,691][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:02:49,195][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:02:49,696][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:02:50,195][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:02:50,719][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:02:51,221][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:02:51,721][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:02:52,226][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:02:52,727][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:02:53,237][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:02:53,738][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:02:54,238][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:02:54,748][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:02:55,248][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:02:55,751][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:02:56,251][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:02:56,752][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:02:57,258][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:02:57,762][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:02:58,264][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:02:58,769][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:02:59,272][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:02:59,776][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:03:00,278][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:03:00,779][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:03:01,283][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:03:01,782][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:03:02,283][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:03:02,784][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:03:03,281][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:03:03,779][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:03:04,277][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:03:04,778][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:03:05,280][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:03:05,783][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:03:06,283][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:03:06,783][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:03:07,281][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:03:07,781][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:03:08,279][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:03:08,777][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:03:09,279][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:03:09,778][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:03:10,277][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:03:10,777][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:03:11,277][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:03:11,776][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:03:12,277][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10209 tokens. [2025-11-12 23:03:12,934][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 61.96%, ΔTime: 00:00:32 [2025-11-12 23:03:13,698][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:03:13,700][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:03:13,702][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:03:14,635][__main__][INFO] - Iteration 60 took 51s (30.18% Gen, 68.01% Train). Generation: 15s, Training: 35s. Estimated remaining time: 42h 1m 0s. Estimated total time: 42h 54m 33s. Time estimates for 10 more iterations: 8m 34s, 100 more iterations: 1h 25m 49s, 500 more iterations: 7h 9m 5s. [2025-11-12 23:03:14,637][__main__][INFO] - Starting iteration 60. [2025-11-12 23:03:15,108][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2025-11-12 23:03:15,109][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:03:17,189][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:03:30,756][__main__][INFO] - Number of regex retries in iteration 60: 1 [2025-11-12 23:03:30,756][__main__][INFO] - agents played in iteration 60 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:03:31,600][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:03:31,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:03:31,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:03:31,690][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:03:31,690][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:03:31,691][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:03:32,306][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:03:32,765][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:03:33,277][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:03:33,786][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:03:34,288][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:03:34,786][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:03:35,285][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:03:35,855][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:03:36,355][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:03:36,856][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:03:37,357][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:03:37,859][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:03:38,367][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:03:38,875][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:03:39,379][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:03:39,885][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:03:40,389][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:03:40,894][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:03:41,392][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:03:41,894][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:03:42,396][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:03:42,897][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:03:43,401][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:03:43,902][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:03:44,405][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:03:44,910][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:03:45,406][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:03:45,901][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:03:46,407][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:03:46,903][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:03:47,401][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:03:47,902][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:03:48,402][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:03:48,903][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:03:49,403][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:03:49,905][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:03:50,407][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:03:50,910][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:03:51,408][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:03:51,910][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:03:52,409][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:03:52,923][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:03:53,425][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:03:53,927][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:03:54,427][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:03:54,928][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:03:55,448][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:03:55,949][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:03:56,448][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:03:56,951][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:03:57,452][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:03:57,957][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:03:58,460][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:03:58,961][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:03:59,468][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:03:59,969][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:04:00,468][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:04:00,966][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:04:01,465][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:04:01,967][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:04:02,468][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:04:02,969][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:04:03,470][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:04:03,971][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:04:04,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10267 tokens. [2025-11-12 23:04:05,145][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.13%, ΔTime: 00:00:32 [2025-11-12 23:04:05,903][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:04:05,904][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:04:05,906][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:04:07,753][__main__][INFO] - Iteration 61 took 52s (29.72% Gen, 66.77% Train). Generation: 15s, Training: 35s. Estimated remaining time: 42h 57m 49s. Estimated total time: 43h 52m 15s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 44s, 500 more iterations: 7h 18m 42s. [2025-11-12 23:04:07,755][__main__][INFO] - Starting iteration 61. [2025-11-12 23:04:08,282][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-12 23:04:08,283][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:04:24,438][__main__][INFO] - Number of regex retries in iteration 61: 0 [2025-11-12 23:04:24,438][__main__][INFO] - agents played in iteration 61 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:04:25,214][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:04:25,238][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:04:25,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:04:25,284][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:04:25,285][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:04:25,286][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:04:25,919][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:04:26,375][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:04:26,884][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:04:27,385][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:04:27,886][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:04:28,385][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:04:28,885][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:04:29,382][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:04:29,881][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:04:30,384][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:04:30,885][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:04:31,386][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:04:31,890][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:04:32,396][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:04:32,899][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:04:33,401][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:04:33,903][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:04:34,406][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:04:34,909][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:04:35,407][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:04:35,907][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:04:36,409][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:04:36,910][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:04:37,409][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:04:37,907][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:04:38,407][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:04:38,906][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:04:39,404][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:04:39,902][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:04:40,401][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:04:40,902][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:04:41,404][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:04:41,904][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:04:42,404][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:04:42,905][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:04:43,405][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:04:43,905][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:04:44,405][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:04:44,905][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:04:45,406][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:04:45,907][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:04:46,412][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:04:46,919][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:04:47,419][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:04:47,919][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:04:48,420][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:04:48,920][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:04:49,420][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:04:49,920][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:04:50,419][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:04:50,917][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:04:51,416][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:04:51,916][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:04:52,423][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:04:52,923][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:04:53,423][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:04:53,923][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:04:54,422][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:04:54,936][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:04:55,437][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:04:55,944][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:04:56,443][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:04:56,944][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:04:57,445][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:04:57,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10127 tokens. [2025-11-12 23:04:58,640][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 62.02%, ΔTime: 00:00:32 [2025-11-12 23:04:59,395][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:04:59,397][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:04:59,399][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:05:00,328][__main__][INFO] - Iteration 62 took 52s (31.04% Gen, 67.17% Train). Generation: 16s, Training: 34s. Estimated remaining time: 42h 27m 0s. Estimated total time: 43h 22m 19s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 44s, 500 more iterations: 7h 13m 43s. [2025-11-12 23:05:00,331][__main__][INFO] - Starting iteration 62. [2025-11-12 23:05:00,832][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-12 23:05:00,833][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:05:15,951][__main__][INFO] - Number of regex retries in iteration 62: 0 [2025-11-12 23:05:15,952][__main__][INFO] - agents played in iteration 62 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:05:16,793][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:05:16,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:05:16,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:05:16,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:05:16,886][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:05:16,887][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:05:17,498][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:05:17,956][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:05:18,466][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:05:18,968][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:05:19,469][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:05:19,968][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:05:20,466][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:05:20,976][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:05:21,483][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:05:21,988][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:05:22,490][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:05:22,990][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:05:23,489][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:05:23,993][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:05:24,494][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:05:24,994][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:05:25,493][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:05:25,992][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:05:26,489][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:05:26,988][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:05:27,489][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:05:27,986][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:05:28,484][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:05:28,983][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:05:29,481][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:05:29,985][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:05:30,483][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:05:30,982][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:05:31,484][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:05:31,981][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:05:32,483][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:05:32,986][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:05:33,486][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:05:33,986][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:05:34,486][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:05:34,986][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:05:35,490][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:05:35,993][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:05:36,495][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:05:36,998][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:05:37,500][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:05:38,006][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:05:38,507][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:05:39,011][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:05:39,512][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:05:40,011][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:05:40,510][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:05:41,009][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:05:41,507][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:05:42,007][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:05:42,507][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:05:43,005][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:05:43,505][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:05:44,005][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:05:44,506][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:05:45,007][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:05:45,506][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:05:46,009][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:05:46,510][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:05:47,013][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:05:47,517][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:05:48,022][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:05:48,528][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:05:49,038][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:05:49,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10191 tokens. [2025-11-12 23:05:50,227][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.09%, ΔTime: 00:00:32 [2025-11-12 23:05:50,992][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:05:50,994][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:05:50,996][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:05:51,906][__main__][INFO] - Iteration 63 took 51s (29.60% Gen, 68.62% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 37m 32s. Estimated total time: 42h 33m 42s. Time estimates for 10 more iterations: 8m 30s, 100 more iterations: 1h 25m 7s, 500 more iterations: 7h 5m 37s. [2025-11-12 23:05:51,908][__main__][INFO] - Starting iteration 63. [2025-11-12 23:05:52,409][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-12 23:05:52,409][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:05:57,088][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:06:01,189][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls Given the values, you should prioritize the items where your value is significantly higher than Bob's. Hats and books are beneficial for you, as your values are 10, while Bob's values for hats and books are 1. You should not allocate any balls since your value for balls is 1, and Bob values them at 10. By proposing to take all hats and none of the other items, you can maximize your points from the hats. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:06:09,485][__main__][INFO] - Number of regex retries in iteration 63: 2 [2025-11-12 23:06:09,486][__main__][INFO] - agents played in iteration 63 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:06:10,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:06:10,398][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:06:10,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:06:10,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:06:10,449][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:06:10,450][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:06:11,096][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:06:11,556][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:06:12,075][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:06:12,580][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:06:13,081][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:06:13,584][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:06:14,082][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:06:14,586][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:06:15,085][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:06:15,591][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:06:16,096][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:06:16,597][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:06:17,096][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:06:17,598][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:06:18,100][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:06:18,600][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:06:19,103][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:06:19,603][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:06:20,105][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:06:20,607][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:06:21,109][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:06:21,631][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:06:22,134][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:06:22,637][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:06:23,139][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:06:23,640][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:06:24,145][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:06:24,649][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:06:25,155][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:06:25,659][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:06:26,159][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:06:26,678][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:06:27,181][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:06:27,687][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:06:28,192][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:06:28,692][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:06:29,201][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:06:29,704][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:06:30,205][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:06:30,706][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:06:31,206][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:06:31,705][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:06:32,206][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:06:32,705][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:06:33,207][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:06:33,707][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:06:34,205][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:06:34,705][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:06:35,207][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:06:35,710][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:06:36,209][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:06:36,711][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:06:37,214][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:06:37,714][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:06:38,219][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:06:38,721][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:06:39,222][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:06:39,721][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:06:40,221][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:06:40,719][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:06:41,218][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:06:41,717][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:06:42,216][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:06:42,717][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:06:43,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10364 tokens. [2025-11-12 23:06:43,911][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.05%, Current % of VRAM taken: 58.30%, Block Peak % of device VRAM: 62.04%, ΔTime: 00:00:32 [2025-11-12 23:06:44,675][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:06:44,677][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:06:44,679][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:06:45,636][__main__][INFO] - Iteration 64 took 53s (32.08% Gen, 66.12% Train). Generation: 17s, Training: 35s. Estimated remaining time: 43h 24m 17s. Estimated total time: 44h 21m 21s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 42s, 500 more iterations: 7h 23m 33s. [2025-11-12 23:06:45,638][__main__][INFO] - Starting iteration 64. [2025-11-12 23:06:46,188][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-12 23:06:46,189][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:07:02,321][__main__][INFO] - Number of regex retries in iteration 64: 0 [2025-11-12 23:07:02,322][__main__][INFO] - agents played in iteration 64 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:07:03,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:07:03,307][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:07:03,333][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:07:03,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:07:03,357][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:07:03,357][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:07:04,031][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:07:04,488][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:07:05,000][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:07:05,505][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:07:06,008][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:07:06,508][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:07:07,009][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:07:07,518][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:07:08,024][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:07:08,552][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:07:09,058][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:07:09,560][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:07:10,070][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:07:10,572][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:07:11,077][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:07:11,598][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:07:12,101][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:07:12,601][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:07:13,103][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:07:13,603][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:07:14,108][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:07:14,606][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:07:15,110][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:07:15,613][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:07:16,119][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:07:16,621][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:07:17,122][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:07:17,622][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:07:18,131][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:07:18,633][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:07:19,135][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:07:19,637][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:07:20,135][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:07:20,637][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:07:21,140][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:07:21,642][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:07:22,145][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:07:22,646][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:07:23,148][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:07:23,654][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:07:24,154][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:07:24,656][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:07:25,157][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:07:25,658][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:07:26,160][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:07:26,661][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:07:27,162][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:07:27,675][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:07:28,175][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:07:28,673][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:07:29,179][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:07:29,677][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:07:30,188][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:07:30,683][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:07:31,182][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:07:31,696][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:07:32,194][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:07:32,690][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:07:33,191][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:07:33,692][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:07:34,202][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:07:34,701][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:07:35,200][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:07:35,708][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:07:36,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10287 tokens. [2025-11-12 23:07:36,880][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:32 [2025-11-12 23:07:37,632][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:07:37,634][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:07:37,636][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:07:38,550][__main__][INFO] - Iteration 65 took 52s (30.81% Gen, 67.44% Train). Generation: 16s, Training: 35s. Estimated remaining time: 42h 40m 10s. Estimated total time: 43h 38m 7s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 16s, 500 more iterations: 7h 16m 21s. [2025-11-12 23:07:38,553][__main__][INFO] - Starting iteration 65. [2025-11-12 23:07:39,043][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-12 23:07:39,044][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:07:54,529][__main__][INFO] - Number of regex retries in iteration 65: 0 [2025-11-12 23:07:54,530][__main__][INFO] - agents played in iteration 65 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:07:55,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:07:55,497][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:07:55,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:07:55,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:07:55,545][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:07:55,546][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:07:56,214][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:07:56,681][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:07:57,191][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:07:57,692][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:07:58,197][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:07:58,699][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:07:59,209][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:07:59,708][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:08:00,210][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:08:00,710][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:08:01,209][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:08:01,711][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:08:02,211][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:08:02,711][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:08:03,213][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:08:03,714][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:08:04,216][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:08:04,719][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:08:05,222][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:08:05,727][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:08:06,229][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:08:06,732][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:08:07,239][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:08:07,744][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:08:08,249][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:08:08,750][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:08:09,252][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:08:09,758][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:08:10,264][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:08:10,769][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:08:11,280][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:08:11,780][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:08:12,296][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:08:12,796][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:08:13,296][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:08:13,814][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:08:14,316][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:08:14,819][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:08:15,318][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:08:15,818][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:08:16,322][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:08:16,822][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:08:17,321][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:08:17,821][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:08:18,320][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:08:18,824][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:08:19,323][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:08:19,823][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:08:20,325][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:08:20,827][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:08:21,329][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:08:21,829][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:08:22,331][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:08:22,836][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:08:23,337][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:08:23,838][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:08:24,341][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:08:24,842][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:08:25,348][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:08:25,848][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:08:26,345][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:08:26,844][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:08:27,342][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:08:27,840][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:08:28,344][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10315 tokens. [2025-11-12 23:08:29,031][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 62.13%, ΔTime: 00:00:32 [2025-11-12 23:08:29,785][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:08:29,787][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:08:29,789][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:08:30,709][__main__][INFO] - Iteration 66 took 51s (29.97% Gen, 68.24% Train). Generation: 15s, Training: 35s. Estimated remaining time: 42h 4m 29s. Estimated total time: 43h 3m 18s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 6s, 500 more iterations: 7h 10m 33s. [2025-11-12 23:08:30,711][__main__][INFO] - Starting iteration 66. [2025-11-12 23:08:31,216][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-12 23:08:31,216][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:08:33,163][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:08:33,268][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:08:46,131][__main__][INFO] - Number of regex retries in iteration 66: 2 [2025-11-12 23:08:46,131][__main__][INFO] - agents played in iteration 66 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:08:46,996][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:08:47,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:08:47,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:08:47,067][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:08:47,068][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:08:47,069][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:08:47,753][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:08:48,217][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:08:48,728][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:08:49,238][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:08:49,746][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:08:50,249][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:08:50,755][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:08:51,262][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:08:51,765][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:08:52,269][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:08:52,768][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:08:53,276][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:08:53,777][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:08:54,279][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:08:54,782][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:08:55,281][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:08:55,781][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:08:56,282][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:08:56,782][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:08:57,282][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:08:57,785][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:08:58,310][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:08:58,813][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:08:59,319][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:08:59,824][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:09:00,323][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:09:00,829][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:09:01,330][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:09:01,834][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:09:02,344][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:09:02,845][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:09:03,347][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:09:03,848][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:09:04,349][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:09:04,852][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:09:05,358][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:09:05,860][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:09:06,364][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:09:06,864][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:09:07,364][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:09:07,864][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:09:08,362][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:09:08,862][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:09:09,363][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:09:09,861][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:09:10,361][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:09:10,860][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:09:11,361][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:09:11,862][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:09:12,363][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:09:12,862][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:09:13,362][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:09:13,860][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:09:14,365][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:09:14,865][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:09:15,362][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:09:15,864][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:09:16,369][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:09:16,873][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:09:17,378][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:09:17,880][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:09:18,383][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:09:18,883][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:09:19,386][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:09:19,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10292 tokens. [2025-11-12 23:09:20,519][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.10%, ΔTime: 00:00:32 [2025-11-12 23:09:21,248][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:09:21,249][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:09:21,251][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:09:22,167][__main__][INFO] - Iteration 67 took 50s (29.27% Gen, 68.93% Train). Generation: 14s, Training: 35s. Estimated remaining time: 41h 27m 56s. Estimated total time: 42h 27m 36s. Time estimates for 10 more iterations: 8m 29s, 100 more iterations: 1h 24m 55s, 500 more iterations: 7h 4m 36s. [2025-11-12 23:09:22,170][__main__][INFO] - Starting iteration 67. [2025-11-12 23:09:22,654][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-12 23:09:22,655][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:09:37,889][__main__][INFO] - Number of regex retries in iteration 67: 0 [2025-11-12 23:09:37,889][__main__][INFO] - agents played in iteration 67 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:09:38,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:09:38,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:09:38,811][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:09:38,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:09:38,835][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:09:38,836][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:09:39,487][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:09:39,949][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:09:40,470][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:09:40,979][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:09:41,493][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:09:42,005][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:09:42,512][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:09:43,024][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:09:43,527][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:09:44,028][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:09:44,529][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:09:45,033][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:09:45,540][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:09:46,043][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:09:46,547][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:09:47,047][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:09:47,549][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:09:48,050][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:09:48,549][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:09:49,050][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:09:49,555][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:09:50,057][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:09:50,559][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:09:51,065][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:09:51,568][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:09:52,067][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:09:52,569][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:09:53,070][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:09:53,574][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:09:54,078][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:09:54,579][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:09:55,079][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:09:55,578][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:09:56,087][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:09:56,588][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:09:57,090][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:09:57,603][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:09:58,105][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:09:58,617][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:09:59,117][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:09:59,618][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:10:00,118][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:10:00,623][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:10:01,123][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:10:01,621][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:10:02,120][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:10:02,632][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:10:03,131][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:10:03,630][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:10:04,129][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:10:04,630][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:10:05,138][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:10:05,639][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:10:06,143][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:10:06,649][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:10:07,151][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:10:07,651][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:10:08,154][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:10:08,657][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:10:09,162][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:10:09,665][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:10:10,167][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:10:10,666][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:10:11,167][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:10:11,667][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10239 tokens. [2025-11-12 23:10:12,293][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.46%, ΔTime: 00:00:32 [2025-11-12 23:10:13,041][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:10:13,043][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:10:13,045][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:10:13,983][__main__][INFO] - Iteration 68 took 51s (29.68% Gen, 68.49% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 45m 57s. Estimated total time: 42h 46m 29s. Time estimates for 10 more iterations: 8m 33s, 100 more iterations: 1h 25m 32s, 500 more iterations: 7h 7m 44s. [2025-11-12 23:10:13,985][__main__][INFO] - Starting iteration 68. [2025-11-12 23:10:14,475][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-12 23:10:14,476][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:10:16,469][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:10:16,472][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:10:29,748][__main__][INFO] - Number of regex retries in iteration 68: 2 [2025-11-12 23:10:29,749][__main__][INFO] - agents played in iteration 68 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:10:30,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:10:30,580][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:10:30,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:10:30,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:10:30,632][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:10:30,633][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:10:31,287][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:10:31,747][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:10:32,254][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:10:32,762][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:10:33,263][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:10:33,763][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:10:34,264][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:10:34,759][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:10:35,261][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:10:35,763][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:10:36,267][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:10:36,767][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:10:37,266][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:10:37,767][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:10:38,265][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:10:38,766][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:10:39,264][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:10:39,771][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:10:40,288][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:10:40,792][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:10:41,290][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:10:41,794][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:10:42,299][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:10:42,803][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:10:43,307][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:10:43,809][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:10:44,316][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:10:44,818][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:10:45,321][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:10:45,825][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:10:46,325][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:10:46,826][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:10:47,330][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:10:47,834][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:10:48,346][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:10:48,848][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:10:49,347][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:10:49,847][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:10:50,348][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:10:50,850][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:10:51,353][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:10:51,852][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:10:52,357][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:10:52,858][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:10:53,360][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:10:53,862][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:10:54,362][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:10:54,865][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:10:55,371][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:10:55,877][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:10:56,388][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:10:56,894][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:10:57,394][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:10:57,896][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:10:58,398][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:10:58,909][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:10:59,413][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:10:59,915][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:11:00,421][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:11:00,920][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:11:01,433][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:11:01,938][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:11:02,443][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:11:02,950][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:11:03,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10381 tokens. [2025-11-12 23:11:04,080][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.16%, ΔTime: 00:00:32 [2025-11-12 23:11:04,802][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:11:04,804][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:11:04,805][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:11:05,740][__main__][INFO] - Iteration 69 took 51s (29.79% Gen, 68.38% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 41m 51s. Estimated total time: 42h 43m 15s. Time estimates for 10 more iterations: 8m 32s, 100 more iterations: 1h 25m 26s, 500 more iterations: 7h 7m 12s. [2025-11-12 23:11:05,742][__main__][INFO] - Starting iteration 69. [2025-11-12 23:11:06,227][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-12 23:11:06,228][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:11:08,497][mllm.models.large_language_model_local][WARNING] - Response Proposal: 5 hats, 5 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:11:08,640][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:11:22,007][__main__][INFO] - Number of regex retries in iteration 69: 2 [2025-11-12 23:11:22,007][__main__][INFO] - agents played in iteration 69 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:11:22,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:11:22,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:11:22,871][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:11:22,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:11:22,894][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:11:22,895][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:11:23,575][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:11:24,040][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:11:24,548][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:11:25,051][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:11:25,553][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:11:26,057][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:11:26,560][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:11:27,061][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:11:27,564][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:11:28,062][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:11:28,565][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:11:29,067][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:11:29,565][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:11:30,063][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:11:30,564][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:11:31,068][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:11:31,569][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:11:32,069][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:11:32,567][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:11:33,070][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:11:33,571][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:11:34,070][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:11:34,569][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:11:35,069][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:11:35,577][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:11:36,080][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:11:36,589][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:11:37,091][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:11:37,597][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:11:38,103][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:11:38,618][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:11:39,122][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:11:39,638][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:11:40,140][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:11:40,641][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:11:41,144][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:11:41,647][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:11:42,152][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:11:42,652][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:11:43,153][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:11:43,653][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:11:44,153][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:11:44,655][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:11:45,152][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:11:45,653][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:11:46,159][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:11:46,658][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:11:47,157][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:11:47,667][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:11:48,169][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:11:48,671][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:11:49,175][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:11:49,676][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:11:50,176][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:11:50,677][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:11:51,182][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:11:51,688][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:11:52,189][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:11:52,692][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:11:53,195][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:11:53,699][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:11:54,205][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:11:54,707][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:11:55,211][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:11:55,725][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10356 tokens. [2025-11-12 23:11:56,363][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.16%, ΔTime: 00:00:32 [2025-11-12 23:11:57,146][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:11:57,148][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:11:57,149][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:11:58,076][__main__][INFO] - Iteration 70 took 51s (30.43% Gen, 67.78% Train). Generation: 15s, Training: 35s. Estimated remaining time: 42h 10m 12s. Estimated total time: 43h 12m 28s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 24s, 500 more iterations: 7h 12m 4s. [2025-11-12 23:11:58,078][__main__][INFO] - Starting iteration 70. [2025-11-12 23:11:58,561][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2025-11-12 23:11:58,562][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:12:13,775][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:12:14,605][__main__][INFO] - Number of regex retries in iteration 70: 1 [2025-11-12 23:12:14,605][__main__][INFO] - agents played in iteration 70 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:12:15,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:12:15,580][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:12:15,602][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:12:15,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:12:15,625][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:12:15,626][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:12:16,307][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:12:16,770][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:12:17,280][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:12:17,790][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:12:18,292][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:12:18,800][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:12:19,302][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:12:19,806][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:12:20,307][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:12:20,843][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:12:21,349][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:12:21,852][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:12:22,354][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:12:22,865][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:12:23,369][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:12:23,872][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:12:24,378][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:12:24,879][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:12:25,381][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:12:25,881][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:12:26,383][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:12:26,890][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:12:27,394][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:12:27,901][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:12:28,406][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:12:28,910][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:12:29,422][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:12:29,926][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:12:30,428][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:12:30,937][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:12:31,444][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:12:31,952][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:12:32,459][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:12:32,959][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:12:33,462][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:12:33,962][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:12:34,464][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:12:34,965][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:12:35,463][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:12:35,962][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:12:36,464][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:12:36,964][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:12:37,466][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:12:37,964][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:12:38,464][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:12:38,971][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:12:39,469][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:12:39,970][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:12:40,468][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:12:40,970][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:12:41,476][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:12:41,979][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:12:42,481][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:12:42,983][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:12:43,486][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:12:43,991][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:12:44,493][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:12:44,993][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:12:45,495][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:12:45,993][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:12:46,492][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:12:46,991][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:12:47,491][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:12:47,995][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:12:48,496][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10350 tokens. [2025-11-12 23:12:49,192][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.40%, ΔTime: 00:00:32 [2025-11-12 23:12:49,959][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:12:49,960][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:12:49,962][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:12:51,832][__main__][INFO] - Iteration 71 took 53s (30.12% Gen, 66.37% Train). Generation: 16s, Training: 35s. Estimated remaining time: 43h 20m 22s. Estimated total time: 44h 23m 33s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 47s, 500 more iterations: 7h 23m 55s. [2025-11-12 23:12:51,834][__main__][INFO] - Starting iteration 71. [2025-11-12 23:12:52,377][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-12 23:12:52,378][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:13:08,758][__main__][INFO] - Number of regex retries in iteration 71: 0 [2025-11-12 23:13:08,758][__main__][INFO] - agents played in iteration 71 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:13:09,711][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:13:09,739][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:13:09,766][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:13:09,789][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:13:09,790][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:13:09,791][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:13:10,457][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:13:10,913][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:13:11,417][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:13:11,922][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:13:12,424][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:13:12,930][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:13:13,428][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:13:13,926][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:13:14,434][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:13:14,936][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:13:15,444][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:13:15,946][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:13:16,448][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:13:16,951][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:13:17,480][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:13:17,982][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:13:18,484][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:13:18,984][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:13:19,486][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:13:19,986][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:13:20,485][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:13:20,986][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:13:21,488][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:13:21,992][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:13:22,492][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:13:22,994][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:13:23,498][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:13:24,002][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:13:24,503][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:13:25,004][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:13:25,504][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:13:26,006][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:13:26,503][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:13:27,002][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:13:27,504][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:13:28,002][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:13:28,501][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:13:29,002][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:13:29,501][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:13:30,001][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:13:30,504][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:13:31,002][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:13:31,502][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:13:31,999][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:13:32,498][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:13:32,996][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:13:33,491][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:13:33,991][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:13:34,491][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:13:34,996][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:13:35,495][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:13:35,995][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:13:36,495][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:13:36,996][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:13:37,501][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:13:38,004][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:13:38,503][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:13:39,002][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:13:39,501][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:13:40,009][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:13:40,518][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:13:41,026][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:13:41,524][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:13:42,025][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:13:42,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10279 tokens. [2025-11-12 23:13:43,228][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.05%, Current % of VRAM taken: 58.29%, Block Peak % of device VRAM: 62.01%, ΔTime: 00:00:32 [2025-11-12 23:13:43,984][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:13:43,986][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:13:43,988][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:13:44,923][__main__][INFO] - Iteration 72 took 52s (31.17% Gen, 67.05% Train). Generation: 16s, Training: 35s. Estimated remaining time: 42h 43m 15s. Estimated total time: 43h 47m 18s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 34s, 500 more iterations: 7h 17m 53s. [2025-11-12 23:13:44,925][__main__][INFO] - Starting iteration 72. [2025-11-12 23:13:45,492][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-12 23:13:45,493][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:14:01,000][__main__][INFO] - Number of regex retries in iteration 72: 0 [2025-11-12 23:14:01,002][__main__][INFO] - agents played in iteration 72 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:14:01,853][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:14:01,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:14:01,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:14:01,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:14:01,932][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:14:01,934][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:14:02,596][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:14:03,055][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:14:03,561][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:14:04,077][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:14:04,576][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:14:05,079][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:14:05,579][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:14:06,079][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:14:06,591][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:14:07,094][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:14:07,593][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:14:08,092][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:14:08,593][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:14:09,097][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:14:09,607][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:14:10,104][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:14:10,604][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:14:11,105][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:14:11,605][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:14:12,109][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:14:12,609][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:14:13,116][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:14:13,622][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:14:14,123][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:14:14,629][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:14:15,129][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:14:15,631][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:14:16,132][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:14:16,637][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:14:17,149][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:14:17,653][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:14:18,153][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:14:18,661][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:14:19,160][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:14:19,681][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:14:20,180][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:14:20,678][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:14:21,178][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:14:21,677][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:14:22,178][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:14:22,682][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:14:23,188][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:14:23,695][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:14:24,197][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:14:24,695][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:14:25,200][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:14:25,703][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:14:26,204][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:14:26,703][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:14:27,202][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:14:27,705][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:14:28,203][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:14:28,702][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:14:29,206][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:14:29,712][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:14:30,213][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:14:30,716][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:14:31,221][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:14:31,725][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:14:32,224][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:14:32,728][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:14:33,233][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:14:33,741][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:14:34,256][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:14:34,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10318 tokens. [2025-11-12 23:14:35,503][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.00%, ΔTime: 00:00:32 [2025-11-12 23:14:36,264][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:14:36,266][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:14:36,268][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:14:37,234][__main__][INFO] - Iteration 73 took 51s (29.97% Gen, 68.16% Train). Generation: 15s, Training: 35s. Estimated remaining time: 42h 2m 10s. Estimated total time: 43h 7m 6s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 14s, 500 more iterations: 7h 11m 11s. [2025-11-12 23:14:37,236][__main__][INFO] - Starting iteration 73. [2025-11-12 23:14:37,720][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-12 23:14:37,720][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:14:41,336][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:14:54,294][__main__][INFO] - Number of regex retries in iteration 73: 1 [2025-11-12 23:14:54,295][__main__][INFO] - agents played in iteration 73 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:14:55,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:14:55,134][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:14:55,159][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:14:55,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:14:55,183][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:14:55,183][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:14:55,793][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:14:56,249][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:14:56,755][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:14:57,259][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:14:57,758][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:14:58,262][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:14:58,761][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:14:59,276][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:14:59,774][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:15:00,274][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:15:00,779][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:15:01,280][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:15:01,786][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:15:02,284][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:15:02,782][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:15:03,281][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:15:03,779][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:15:04,278][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:15:04,788][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:15:05,289][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:15:05,802][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:15:06,302][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:15:06,805][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:15:07,307][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:15:07,812][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:15:08,316][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:15:08,817][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:15:09,317][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:15:09,822][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:15:10,321][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:15:10,823][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:15:11,327][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:15:11,830][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:15:12,331][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:15:12,829][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:15:13,329][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:15:13,831][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:15:14,329][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:15:14,835][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:15:15,333][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:15:15,833][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:15:16,339][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:15:16,839][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:15:17,338][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:15:17,839][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:15:18,341][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:15:18,844][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:15:19,344][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:15:19,844][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:15:20,343][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:15:20,841][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:15:21,341][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:15:21,840][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:15:22,340][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:15:22,841][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:15:23,341][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:15:23,841][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:15:24,346][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:15:24,848][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:15:25,350][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:15:25,876][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:15:26,379][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:15:26,887][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:15:27,389][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:15:27,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10333 tokens. [2025-11-12 23:15:28,612][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.50%, Block Peak % of device VRAM: 62.16%, ΔTime: 00:00:32 [2025-11-12 23:15:29,401][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:15:29,403][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:15:29,405][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:15:30,333][__main__][INFO] - Iteration 74 took 52s (31.50% Gen, 66.73% Train). Generation: 16s, Training: 35s. Estimated remaining time: 42h 44m 52s. Estimated total time: 43h 50m 40s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 41s, 500 more iterations: 7h 18m 26s. [2025-11-12 23:15:30,335][__main__][INFO] - Starting iteration 74. [2025-11-12 23:15:30,826][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-12 23:15:30,827][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:15:46,758][__main__][INFO] - Number of regex retries in iteration 74: 0 [2025-11-12 23:15:46,758][__main__][INFO] - agents played in iteration 74 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:15:47,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:15:47,696][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:15:47,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:15:47,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:15:47,746][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:15:47,747][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:15:48,395][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:15:48,869][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:15:49,374][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:15:49,879][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:15:50,386][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:15:50,886][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:15:51,395][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:15:51,903][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:15:52,405][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:15:52,912][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:15:53,414][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:15:53,915][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:15:54,415][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:15:54,915][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:15:55,419][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:15:55,921][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:15:56,424][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:15:56,927][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:15:57,433][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:15:57,936][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:15:58,438][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:15:58,939][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:15:59,441][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:15:59,945][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:16:00,448][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:16:00,949][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:16:01,454][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:16:01,956][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:16:02,455][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:16:02,958][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:16:03,474][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:16:03,977][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:16:04,476][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:16:04,981][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:16:05,478][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:16:05,999][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:16:06,498][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:16:06,995][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:16:07,492][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:16:07,989][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:16:08,499][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:16:08,999][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:16:09,499][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:16:10,003][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:16:10,504][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:16:11,000][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:16:11,500][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:16:12,000][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:16:12,504][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:16:13,001][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:16:13,500][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:16:14,009][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:16:14,507][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:16:15,007][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:16:15,507][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:16:16,007][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:16:16,515][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:16:17,014][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:16:17,516][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:16:18,019][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:16:18,520][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:16:19,023][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:16:19,526][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:16:20,024][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:16:20,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10471 tokens. [2025-11-12 23:16:21,219][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:32 [2025-11-12 23:16:22,004][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:16:22,006][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:16:22,008][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:16:22,913][__main__][INFO] - Iteration 75 took 52s (30.59% Gen, 67.67% Train). Generation: 15s, Training: 35s. Estimated remaining time: 42h 17m 40s. Estimated total time: 43h 24m 21s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 48s, 500 more iterations: 7h 14m 3s. [2025-11-12 23:16:22,915][__main__][INFO] - Starting iteration 75. [2025-11-12 23:16:23,414][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-12 23:16:23,414][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:16:39,690][__main__][INFO] - Number of regex retries in iteration 75: 0 [2025-11-12 23:16:39,691][__main__][INFO] - agents played in iteration 75 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:16:40,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:16:40,644][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:16:40,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:16:40,695][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:16:40,695][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:16:40,696][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:16:41,350][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:16:41,807][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:16:42,319][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:16:42,823][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:16:43,324][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:16:43,831][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:16:44,334][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:16:44,837][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:16:45,342][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:16:45,841][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:16:46,343][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:16:46,842][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:16:47,342][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:16:47,843][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:16:48,339][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:16:48,839][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:16:49,346][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:16:49,846][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:16:50,347][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:16:50,850][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:16:51,349][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:16:51,853][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:16:52,362][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:16:52,863][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:16:53,365][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:16:53,865][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:16:54,378][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:16:54,880][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:16:55,382][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:16:55,889][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:16:56,392][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:16:56,905][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:16:57,404][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:16:57,909][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:16:58,420][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:16:58,922][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:16:59,424][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:16:59,923][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:17:00,421][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:17:00,926][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:17:01,429][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:17:01,928][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:17:02,428][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:17:02,926][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:17:03,426][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:17:03,924][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:17:04,423][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:17:04,925][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:17:05,426][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:17:05,925][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:17:06,427][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:17:06,928][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:17:07,432][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:17:07,936][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:17:08,435][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:17:08,940][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:17:09,444][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:17:09,943][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:17:10,441][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:17:10,942][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:17:11,449][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:17:11,949][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:17:12,451][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:17:12,949][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:17:13,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10354 tokens. [2025-11-12 23:17:14,149][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.19%, ΔTime: 00:00:32 [2025-11-12 23:17:14,905][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:17:14,906][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:17:14,909][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:17:15,812][__main__][INFO] - Iteration 76 took 52s (31.06% Gen, 67.21% Train). Generation: 16s, Training: 35s. Estimated remaining time: 42h 32m 22s. Estimated total time: 43h 39m 56s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 19s, 500 more iterations: 7h 16m 39s. [2025-11-12 23:17:15,814][__main__][INFO] - Starting iteration 76. [2025-11-12 23:17:16,332][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-12 23:17:16,332][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:17:32,420][__main__][INFO] - Number of regex retries in iteration 76: 0 [2025-11-12 23:17:32,421][__main__][INFO] - agents played in iteration 76 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:17:33,239][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:17:33,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:17:33,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:17:33,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:17:33,316][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:17:33,317][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:17:33,973][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:17:34,433][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:17:34,935][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:17:35,436][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:17:35,965][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:17:36,465][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:17:36,964][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:17:37,467][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:17:37,969][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:17:38,475][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:17:38,994][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:17:39,500][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:17:40,002][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:17:40,513][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:17:41,016][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:17:41,523][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:17:42,024][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:17:42,526][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:17:43,036][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:17:43,543][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:17:44,047][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:17:44,554][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:17:45,060][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:17:45,565][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:17:46,069][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:17:46,572][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:17:47,081][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:17:47,583][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:17:48,089][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:17:48,589][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:17:49,093][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:17:49,594][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:17:50,094][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:17:50,594][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:17:51,095][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:17:51,598][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:17:52,119][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:17:52,622][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:17:53,125][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:17:53,626][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:17:54,127][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:17:54,631][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:17:55,129][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:17:55,630][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:17:56,133][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:17:56,632][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:17:57,136][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:17:57,634][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:17:58,136][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:17:58,641][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:17:59,144][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:17:59,648][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:18:00,150][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:18:00,651][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:18:01,157][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:18:01,659][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:18:02,162][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:18:02,664][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:18:03,165][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:18:03,667][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:18:04,169][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:18:04,673][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:18:05,176][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:18:05,678][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:18:06,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10564 tokens. [2025-11-12 23:18:06,877][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.35%, ΔTime: 00:00:32 [2025-11-12 23:18:07,657][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:18:07,658][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:18:07,660][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:18:08,615][__main__][INFO] - Iteration 77 took 52s (30.77% Gen, 67.40% Train). Generation: 16s, Training: 35s. Estimated remaining time: 42h 25m 45s. Estimated total time: 43h 34m 11s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 8s, 500 more iterations: 7h 15m 41s. [2025-11-12 23:18:08,618][__main__][INFO] - Starting iteration 77. [2025-11-12 23:18:09,177][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-12 23:18:09,178][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:18:20,430][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:18:25,196][__main__][INFO] - Number of regex retries in iteration 77: 1 [2025-11-12 23:18:25,197][__main__][INFO] - agents played in iteration 77 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:18:26,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:18:26,061][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:18:26,090][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:18:26,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:18:26,117][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:18:26,118][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:18:26,776][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:18:27,241][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:18:27,749][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:18:28,249][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:18:28,753][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:18:29,253][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:18:29,754][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:18:30,259][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:18:30,765][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:18:31,272][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:18:31,776][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:18:32,279][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:18:32,780][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:18:33,284][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:18:33,810][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:18:34,318][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:18:34,824][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:18:35,332][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:18:35,841][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:18:36,345][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:18:36,852][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:18:37,353][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:18:37,857][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:18:38,363][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:18:38,866][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:18:39,373][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:18:39,875][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:18:40,379][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:18:40,894][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:18:41,398][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:18:41,918][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:18:42,419][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:18:42,923][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:18:43,428][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:18:43,930][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:18:44,432][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:18:44,932][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:18:45,437][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:18:45,942][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:18:46,446][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:18:46,946][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:18:47,450][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:18:47,950][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:18:48,451][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:18:48,950][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:18:49,451][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:18:49,955][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:18:50,457][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:18:50,955][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:18:51,455][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:18:51,957][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:18:52,459][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:18:52,962][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:18:53,461][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:18:53,965][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:18:54,463][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:18:54,965][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:18:55,467][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:18:55,968][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:18:56,469][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:18:56,969][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:18:57,467][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:18:57,967][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:18:58,466][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:18:58,966][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10474 tokens. [2025-11-12 23:18:59,652][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:32 [2025-11-12 23:19:00,435][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:19:00,437][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:19:00,438][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:19:01,370][__main__][INFO] - Iteration 78 took 52s (30.67% Gen, 67.47% Train). Generation: 16s, Training: 35s. Estimated remaining time: 42h 22m 21s. Estimated total time: 43h 31m 41s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 3s, 500 more iterations: 7h 15m 16s. [2025-11-12 23:19:01,372][__main__][INFO] - Starting iteration 78. [2025-11-12 23:19:01,899][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-12 23:19:01,899][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:19:05,362][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:19:17,925][__main__][INFO] - Number of regex retries in iteration 78: 1 [2025-11-12 23:19:17,925][__main__][INFO] - agents played in iteration 78 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:19:18,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:19:18,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:19:18,938][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:19:18,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:19:18,961][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:19:18,962][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:19:19,645][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:19:20,115][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:19:20,624][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:19:21,135][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:19:21,641][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:19:22,143][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:19:22,649][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:19:23,147][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:19:23,646][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:19:24,149][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:19:24,650][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:19:25,153][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:19:25,654][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:19:26,157][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:19:26,657][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:19:27,162][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:19:27,667][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:19:28,170][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:19:28,675][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:19:29,181][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:19:29,683][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:19:30,183][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:19:30,682][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:19:31,184][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:19:31,687][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:19:32,189][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:19:32,693][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:19:33,207][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:19:33,708][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:19:34,220][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:19:34,721][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:19:35,223][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:19:35,728][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:19:36,227][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:19:36,727][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:19:37,226][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:19:37,725][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:19:38,240][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:19:38,740][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:19:39,243][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:19:39,741][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:19:40,241][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:19:40,745][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:19:41,247][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:19:41,749][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:19:42,252][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:19:42,752][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:19:43,255][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:19:43,755][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:19:44,253][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:19:44,754][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:19:45,255][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:19:45,758][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:19:46,263][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:19:46,761][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:19:47,267][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:19:47,768][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:19:48,271][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:19:48,774][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:19:49,276][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:19:49,778][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:19:50,278][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:19:50,780][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:19:51,279][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:19:51,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10425 tokens. [2025-11-12 23:19:52,461][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.10%, ΔTime: 00:00:32 [2025-11-12 23:19:53,225][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:19:53,227][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:19:53,229][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:19:54,170][__main__][INFO] - Iteration 79 took 52s (30.66% Gen, 67.54% Train). Generation: 16s, Training: 35s. Estimated remaining time: 42h 23m 22s. Estimated total time: 43h 33m 34s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 7s, 500 more iterations: 7h 15m 35s. [2025-11-12 23:19:54,172][__main__][INFO] - Starting iteration 79. [2025-11-12 23:19:54,721][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-12 23:19:54,721][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:20:10,304][__main__][INFO] - Number of regex retries in iteration 79: 0 [2025-11-12 23:20:10,305][__main__][INFO] - agents played in iteration 79 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:20:11,123][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:20:11,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:20:11,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:20:11,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:20:11,190][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:20:11,191][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:20:11,870][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:20:12,334][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:20:12,845][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:20:13,350][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:20:13,856][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:20:14,358][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:20:14,861][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:20:15,368][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:20:15,873][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:20:16,375][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:20:16,875][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:20:17,380][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:20:17,883][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:20:18,385][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:20:18,887][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:20:19,393][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:20:19,895][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:20:20,397][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:20:20,904][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:20:21,409][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:20:21,918][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:20:22,426][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:20:22,927][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:20:23,441][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:20:23,946][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:20:24,459][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:20:24,963][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:20:25,466][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:20:25,966][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:20:26,468][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:20:26,974][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:20:27,477][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:20:27,981][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:20:28,483][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:20:28,984][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:20:29,490][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:20:29,989][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:20:30,493][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:20:30,994][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:20:31,497][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:20:31,998][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:20:32,502][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:20:33,004][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:20:33,505][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:20:34,005][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:20:34,507][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:20:35,011][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:20:35,514][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:20:36,017][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:20:36,519][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:20:37,020][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:20:37,540][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:20:38,040][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:20:38,543][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:20:39,047][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:20:39,550][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:20:40,060][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:20:40,562][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:20:41,064][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:20:41,568][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:20:42,068][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:20:42,567][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:20:43,066][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:20:43,565][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:20:44,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10507 tokens. [2025-11-12 23:20:44,717][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:32 [2025-11-12 23:20:45,508][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:20:45,509][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:20:45,511][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:20:46,437][__main__][INFO] - Iteration 80 took 51s (30.13% Gen, 68.07% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 54m 46s. Estimated total time: 43h 5m 51s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 11s, 500 more iterations: 7h 10m 58s. [2025-11-12 23:20:46,440][__main__][INFO] - Starting iteration 80. [2025-11-12 23:20:46,933][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2025-11-12 23:20:46,934][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:20:55,657][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:21:03,037][__main__][INFO] - Number of regex retries in iteration 80: 1 [2025-11-12 23:21:03,038][__main__][INFO] - agents played in iteration 80 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:21:03,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:21:03,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:21:03,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:21:03,903][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:21:03,904][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:21:03,905][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:21:04,550][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:21:05,006][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:21:05,518][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:21:06,019][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:21:06,528][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:21:07,028][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:21:07,527][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:21:08,038][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:21:08,539][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:21:09,052][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:21:09,551][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:21:10,059][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:21:10,562][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:21:11,062][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:21:11,562][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:21:12,067][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:21:12,567][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:21:13,079][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:21:13,583][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:21:14,086][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:21:14,591][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:21:15,096][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:21:15,600][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:21:16,105][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:21:16,609][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:21:17,114][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:21:17,614][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:21:18,117][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:21:18,626][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:21:19,128][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:21:19,640][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:21:20,163][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:21:20,664][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:21:21,165][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:21:21,663][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:21:22,169][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:21:22,678][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:21:23,179][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:21:23,684][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:21:24,185][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:21:24,684][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:21:25,183][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:21:25,681][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:21:26,185][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:21:26,686][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:21:27,185][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:21:27,688][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:21:28,189][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:21:28,689][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:21:29,192][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:21:29,696][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:21:30,200][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:21:30,700][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:21:31,202][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:21:31,706][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:21:32,210][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:21:32,714][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:21:33,220][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:21:33,721][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:21:34,223][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:21:34,722][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:21:35,224][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:21:35,724][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:21:36,223][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:21:36,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10459 tokens. [2025-11-12 23:21:37,355][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.21%, ΔTime: 00:00:32 [2025-11-12 23:21:38,158][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:21:38,159][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:21:38,161][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:21:40,026][__main__][INFO] - Iteration 81 took 53s (30.33% Gen, 66.15% Train). Generation: 16s, Training: 35s. Estimated remaining time: 43h 2m 41s. Estimated total time: 44h 14m 40s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 29s, 500 more iterations: 7h 22m 26s. [2025-11-12 23:21:40,028][__main__][INFO] - Starting iteration 81. [2025-11-12 23:21:40,533][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-12 23:21:40,534][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:21:50,885][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls Given the per-item values: - Hats: I value at 10, Alice values at 1. - Books: I value at 10, Alice values at 10. - Balls: I value at 1, Alice values at 10. I will take all the hats since they are worth more to me relative to their value to Alice. I will not propose to take any books as they are valued more by Alice. Similarly, I won't take any balls as they are also more valuable to Alice. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:21:58,658][__main__][INFO] - Number of regex retries in iteration 81: 1 [2025-11-12 23:21:58,658][__main__][INFO] - agents played in iteration 81 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:21:59,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:21:59,655][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:21:59,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:21:59,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:21:59,705][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:21:59,706][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:22:00,402][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:22:00,860][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:22:01,369][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:22:01,878][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:22:02,382][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:22:02,884][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:22:03,389][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:22:03,892][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:22:04,396][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:22:04,902][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:22:05,434][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:22:05,944][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:22:06,450][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:22:06,970][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:22:07,479][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:22:07,986][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:22:08,492][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:22:08,997][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:22:09,504][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:22:10,005][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:22:10,508][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:22:11,010][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:22:11,512][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:22:12,017][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:22:12,524][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:22:13,028][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:22:13,528][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:22:14,031][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:22:14,535][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:22:15,049][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:22:15,551][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:22:16,070][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:22:16,571][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:22:17,078][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:22:17,581][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:22:18,084][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:22:18,583][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:22:19,087][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:22:19,590][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:22:20,096][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:22:20,598][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:22:21,094][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:22:21,592][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:22:22,091][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:22:22,592][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:22:23,092][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:22:23,591][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:22:24,094][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:22:24,591][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:22:25,099][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:22:25,603][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:22:26,104][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:22:26,606][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:22:27,106][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:22:27,605][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:22:28,108][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:22:28,608][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:22:29,105][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:22:29,602][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:22:30,101][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:22:30,604][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:22:31,101][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:22:31,598][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:22:32,095][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:22:32,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10441 tokens. [2025-11-12 23:22:33,290][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:32 [2025-11-12 23:22:34,065][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:22:34,067][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:22:34,069][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:22:35,017][__main__][INFO] - Iteration 82 took 54s (33.27% Gen, 64.99% Train). Generation: 18s, Training: 35s. Estimated remaining time: 44h 11m 19s. Estimated total time: 45h 24m 12s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 48s, 500 more iterations: 7h 34m 2s. [2025-11-12 23:22:35,019][__main__][INFO] - Starting iteration 82. [2025-11-12 23:22:35,533][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-12 23:22:35,534][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:22:51,205][__main__][INFO] - Number of regex retries in iteration 82: 0 [2025-11-12 23:22:51,206][__main__][INFO] - agents played in iteration 82 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:22:52,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:22:52,208][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:22:52,229][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:22:52,251][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:22:52,252][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:22:52,253][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:22:52,926][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:22:53,386][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:22:53,893][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:22:54,393][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:22:54,897][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:22:55,399][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:22:55,903][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:22:56,407][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:22:56,911][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:22:57,413][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:22:57,916][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:22:58,425][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:22:58,933][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:22:59,442][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:22:59,946][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:23:00,447][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:23:00,962][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:23:01,470][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:23:01,984][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:23:02,487][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:23:02,991][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:23:03,503][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:23:04,005][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:23:04,511][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:23:05,014][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:23:05,517][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:23:06,023][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:23:06,524][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:23:07,033][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:23:07,537][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:23:08,036][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:23:08,538][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:23:09,039][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:23:09,540][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:23:10,043][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:23:10,544][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:23:11,047][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:23:11,544][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:23:12,045][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:23:12,547][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:23:13,046][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:23:13,547][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:23:14,057][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:23:14,558][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:23:15,065][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:23:15,566][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:23:16,066][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:23:16,569][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:23:17,071][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:23:17,573][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:23:18,077][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:23:18,583][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:23:19,100][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:23:19,601][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:23:20,100][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:23:20,601][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:23:21,101][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:23:21,603][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:23:22,102][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:23:22,601][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:23:23,106][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:23:23,603][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:23:24,103][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:23:24,603][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:23:25,102][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10368 tokens. [2025-11-12 23:23:25,778][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:32 [2025-11-12 23:23:26,609][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:23:26,610][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:23:26,612][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:23:27,546][__main__][INFO] - Iteration 83 took 52s (30.13% Gen, 68.07% Train). Generation: 15s, Training: 35s. Estimated remaining time: 42h 6m 55s. Estimated total time: 43h 20m 41s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 41s, 500 more iterations: 7h 13m 26s. [2025-11-12 23:23:27,549][__main__][INFO] - Starting iteration 83. [2025-11-12 23:23:28,093][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-12 23:23:28,094][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:23:43,941][__main__][INFO] - Number of regex retries in iteration 83: 0 [2025-11-12 23:23:43,941][__main__][INFO] - agents played in iteration 83 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:23:44,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:23:44,895][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:23:44,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:23:44,946][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:23:44,947][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:23:44,948][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:23:45,616][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:23:46,077][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:23:46,583][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:23:47,088][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:23:47,593][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:23:48,095][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:23:48,599][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:23:49,100][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:23:49,611][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:23:50,109][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:23:50,615][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:23:51,115][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:23:51,621][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:23:52,124][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:23:52,626][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:23:53,124][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:23:53,627][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:23:54,132][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:23:54,642][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:23:55,143][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:23:55,649][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:23:56,150][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:23:56,651][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:23:57,151][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:23:57,651][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:23:58,153][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:23:58,652][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:23:59,153][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:23:59,654][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:24:00,156][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:24:00,658][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:24:01,160][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:24:01,666][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:24:02,163][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:24:02,661][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:24:03,158][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:24:03,656][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:24:04,157][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:24:04,656][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:24:05,155][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:24:05,652][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:24:06,159][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:24:06,665][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:24:07,166][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:24:07,668][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:24:08,182][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:24:08,683][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:24:09,197][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:24:09,703][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:24:10,223][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:24:10,725][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:24:11,226][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:24:11,729][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:24:12,230][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:24:12,733][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:24:13,235][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:24:13,738][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:24:14,239][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:24:14,743][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:24:15,245][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:24:15,747][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:24:16,249][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:24:16,750][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:24:17,253][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:24:17,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10483 tokens. [2025-11-12 23:24:18,446][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:32 [2025-11-12 23:24:19,215][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:24:19,216][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:24:19,218][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:24:20,152][__main__][INFO] - Iteration 84 took 52s (30.44% Gen, 67.76% Train). Generation: 15s, Training: 35s. Estimated remaining time: 42h 8m 20s. Estimated total time: 43h 22m 58s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 45s, 500 more iterations: 7h 13m 49s. [2025-11-12 23:24:20,154][__main__][INFO] - Starting iteration 84. [2025-11-12 23:24:20,711][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-12 23:24:20,711][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:24:23,447][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:24:23,450][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:24:37,082][__main__][INFO] - Number of regex retries in iteration 84: 2 [2025-11-12 23:24:37,083][__main__][INFO] - agents played in iteration 84 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:24:37,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:24:37,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:24:37,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:24:37,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:24:37,995][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:24:37,996][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:24:38,631][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:24:39,088][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:24:39,605][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:24:40,114][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:24:40,624][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:24:41,129][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:24:41,633][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:24:42,142][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:24:42,646][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:24:43,150][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:24:43,655][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:24:44,159][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:24:44,667][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:24:45,172][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:24:45,677][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:24:46,180][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:24:46,683][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:24:47,198][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:24:47,699][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:24:48,197][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:24:48,701][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:24:49,202][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:24:49,714][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:24:50,216][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:24:50,714][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:24:51,214][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:24:51,712][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:24:52,212][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:24:52,708][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:24:53,206][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:24:53,724][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:24:54,227][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:24:54,729][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:24:55,233][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:24:55,732][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:24:56,233][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:24:56,733][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:24:57,236][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:24:57,741][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:24:58,241][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:24:58,742][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:24:59,245][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:24:59,746][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:25:00,252][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:25:00,753][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:25:01,254][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:25:01,751][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:25:02,250][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:25:02,748][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:25:03,250][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:25:03,750][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:25:04,254][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:25:04,759][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:25:05,260][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:25:05,761][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:25:06,262][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:25:06,768][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:25:07,267][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:25:07,767][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:25:08,266][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:25:08,766][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:25:09,267][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:25:09,766][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:25:10,267][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:25:10,768][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10526 tokens. [2025-11-12 23:25:11,400][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.17%, ΔTime: 00:00:32 [2025-11-12 23:25:12,177][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:25:12,178][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:25:12,180][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:25:13,100][__main__][INFO] - Iteration 85 took 52s (31.25% Gen, 66.99% Train). Generation: 16s, Training: 35s. Estimated remaining time: 42h 23m 58s. Estimated total time: 43h 39m 29s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 18s, 500 more iterations: 7h 16m 34s. [2025-11-12 23:25:13,102][__main__][INFO] - Starting iteration 85. [2025-11-12 23:25:13,574][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-12 23:25:13,575][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:25:28,293][__main__][INFO] - Number of regex retries in iteration 85: 0 [2025-11-12 23:25:28,294][__main__][INFO] - agents played in iteration 85 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:25:29,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:25:29,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:25:29,204][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:25:29,227][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:25:29,227][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:25:29,229][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:25:29,879][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:25:30,342][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:25:30,852][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:25:31,360][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:25:31,863][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:25:32,370][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:25:32,876][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:25:33,384][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:25:33,896][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:25:34,410][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:25:34,916][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:25:35,421][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:25:35,929][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:25:36,432][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:25:36,945][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:25:37,451][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:25:37,961][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:25:38,464][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:25:38,967][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:25:39,470][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:25:39,973][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:25:40,476][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:25:40,976][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:25:41,478][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:25:41,979][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:25:42,483][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:25:42,986][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:25:43,487][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:25:44,000][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:25:44,500][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:25:45,004][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:25:45,506][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:25:46,007][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:25:46,523][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:25:47,024][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:25:47,524][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:25:48,028][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:25:48,527][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:25:49,044][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:25:49,545][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:25:50,047][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:25:50,547][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:25:51,047][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:25:51,551][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:25:52,055][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:25:52,554][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:25:53,052][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:25:53,551][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:25:54,050][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:25:54,557][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:25:55,061][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:25:55,564][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:25:56,068][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:25:56,570][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:25:57,075][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:25:57,578][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:25:58,082][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:25:58,582][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:25:59,085][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:25:59,587][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:26:00,086][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:26:00,587][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:26:01,087][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:26:01,586][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:26:02,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10536 tokens. [2025-11-12 23:26:02,716][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.38%, ΔTime: 00:00:32 [2025-11-12 23:26:03,480][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:26:03,482][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:26:03,483][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:26:04,421][__main__][INFO] - Iteration 86 took 50s (28.95% Gen, 69.21% Train). Generation: 14s, Training: 35s. Estimated remaining time: 41h 6m 0s. Estimated total time: 42h 22m 23s. Time estimates for 10 more iterations: 8m 28s, 100 more iterations: 1h 24m 44s, 500 more iterations: 7h 3m 43s. [2025-11-12 23:26:04,424][__main__][INFO] - Starting iteration 86. [2025-11-12 23:26:04,900][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-12 23:26:04,901][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:26:20,445][__main__][INFO] - Number of regex retries in iteration 86: 0 [2025-11-12 23:26:20,446][__main__][INFO] - agents played in iteration 86 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:26:21,240][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:26:21,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:26:21,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:26:21,319][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:26:21,319][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:26:21,320][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:26:21,948][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:26:22,405][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:26:22,908][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:26:23,425][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:26:23,925][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:26:24,437][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:26:24,947][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:26:25,449][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:26:25,961][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:26:26,472][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:26:26,976][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:26:27,483][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:26:27,989][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:26:28,490][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:26:28,995][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:26:29,497][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:26:30,008][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:26:30,514][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:26:31,019][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:26:31,522][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:26:32,023][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:26:32,526][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:26:33,026][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:26:33,527][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:26:34,028][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:26:34,527][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:26:35,076][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:26:35,579][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:26:36,084][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:26:36,589][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:26:37,093][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:26:37,597][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:26:38,101][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:26:38,603][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:26:39,104][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:26:39,605][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:26:40,105][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:26:40,618][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:26:41,122][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:26:41,640][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:26:42,143][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:26:42,645][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:26:43,147][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:26:43,648][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:26:44,154][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:26:44,655][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:26:45,156][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:26:45,662][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:26:46,163][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:26:46,662][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:26:47,161][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:26:47,660][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:26:48,166][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:26:48,664][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:26:49,164][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:26:49,662][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:26:50,159][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:26:50,662][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:26:51,163][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:26:51,663][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:26:52,165][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:26:52,664][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:26:53,165][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:26:53,665][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:26:54,165][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10551 tokens. [2025-11-12 23:26:54,809][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:32 [2025-11-12 23:26:55,567][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:26:55,568][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:26:55,570][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:26:56,503][__main__][INFO] - Iteration 87 took 51s (30.12% Gen, 68.07% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 42m 53s. Estimated total time: 43h 0m 8s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 0s, 500 more iterations: 7h 10m 1s. [2025-11-12 23:26:56,505][__main__][INFO] - Starting iteration 87. [2025-11-12 23:26:56,991][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-12 23:26:56,992][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:27:13,802][__main__][INFO] - Number of regex retries in iteration 87: 0 [2025-11-12 23:27:13,803][__main__][INFO] - agents played in iteration 87 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:27:14,611][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:27:14,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:27:14,660][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:27:14,683][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:27:14,683][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:27:14,685][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:27:15,362][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:27:15,828][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:27:16,341][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:27:16,854][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:27:17,360][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:27:17,864][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:27:18,368][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:27:18,875][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:27:19,382][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:27:19,889][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:27:20,387][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:27:20,893][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:27:21,398][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:27:21,902][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:27:22,411][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:27:22,914][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:27:23,413][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:27:23,910][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:27:24,408][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:27:24,909][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:27:25,409][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:27:25,909][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:27:26,416][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:27:26,916][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:27:27,430][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:27:27,934][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:27:28,434][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:27:28,943][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:27:29,445][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:27:29,959][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:27:30,463][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:27:30,968][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:27:31,473][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:27:31,982][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:27:32,484][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:27:32,985][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:27:33,487][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:27:33,991][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:27:34,494][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:27:35,018][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:27:35,522][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:27:36,020][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:27:36,529][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:27:37,026][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:27:37,524][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:27:38,023][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:27:38,521][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:27:39,020][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:27:39,521][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:27:40,020][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:27:40,523][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:27:41,023][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:27:41,522][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:27:42,022][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:27:42,521][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:27:43,038][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:27:43,535][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:27:44,035][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:27:44,541][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:27:45,044][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:27:45,554][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:27:46,054][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:27:46,556][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:27:47,059][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:27:47,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10469 tokens. [2025-11-12 23:27:48,195][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.45%, ΔTime: 00:00:32 [2025-11-12 23:27:48,955][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:27:48,957][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:27:48,958][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:27:49,910][__main__][INFO] - Iteration 88 took 52s (31.77% Gen, 66.43% Train). Generation: 16s, Training: 35s. Estimated remaining time: 42h 47m 51s. Estimated total time: 44h 5m 59s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 11s, 500 more iterations: 7h 20m 59s. [2025-11-12 23:27:49,912][__main__][INFO] - Starting iteration 88. [2025-11-12 23:27:50,413][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-12 23:27:50,414][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:27:53,172][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:28:06,979][__main__][INFO] - Number of regex retries in iteration 88: 1 [2025-11-12 23:28:06,980][__main__][INFO] - agents played in iteration 88 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:28:07,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:28:07,838][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:28:07,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:28:07,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:28:07,883][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:28:07,883][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:28:08,550][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:28:09,014][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:28:09,523][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:28:10,028][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:28:10,544][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:28:11,046][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:28:11,564][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:28:12,069][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:28:12,573][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:28:13,078][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:28:13,581][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:28:14,089][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:28:14,602][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:28:15,118][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:28:15,634][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:28:16,148][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:28:16,653][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:28:17,161][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:28:17,662][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:28:18,172][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:28:18,674][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:28:19,177][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:28:19,680][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:28:20,183][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:28:20,688][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:28:21,185][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:28:21,687][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:28:22,191][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:28:22,694][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:28:23,199][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:28:23,701][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:28:24,203][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:28:24,700][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:28:25,200][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:28:25,701][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:28:26,204][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:28:26,703][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:28:27,205][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:28:27,705][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:28:28,206][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:28:28,710][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:28:29,208][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:28:29,706][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:28:30,203][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:28:30,703][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:28:31,204][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:28:31,702][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:28:32,201][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:28:32,701][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:28:33,202][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:28:33,702][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:28:34,202][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:28:34,702][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:28:35,201][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:28:35,701][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:28:36,201][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:28:36,704][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:28:37,204][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:28:37,705][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:28:38,205][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:28:38,705][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:28:39,206][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:28:39,705][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:28:40,205][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:28:40,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10699 tokens. [2025-11-12 23:28:41,341][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:32 [2025-11-12 23:28:42,115][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:28:42,116][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:28:42,118][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:28:43,052][__main__][INFO] - Iteration 89 took 52s (31.47% Gen, 66.75% Train). Generation: 16s, Training: 35s. Estimated remaining time: 42h 32m 56s. Estimated total time: 43h 51m 58s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 43s, 500 more iterations: 7h 18m 39s. [2025-11-12 23:28:43,055][__main__][INFO] - Starting iteration 89. [2025-11-12 23:28:43,521][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-12 23:28:43,522][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:28:45,659][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:28:59,161][__main__][INFO] - Number of regex retries in iteration 89: 1 [2025-11-12 23:28:59,161][__main__][INFO] - agents played in iteration 89 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:28:59,974][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:28:59,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:29:00,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:29:00,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:29:00,044][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:29:00,044][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:29:00,722][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:29:01,186][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:29:01,692][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:29:02,191][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:29:02,704][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:29:03,211][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:29:03,715][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:29:04,219][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:29:04,722][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:29:05,227][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:29:05,733][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:29:06,240][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:29:06,745][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:29:07,248][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:29:07,752][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:29:08,269][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:29:08,771][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:29:09,270][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:29:09,774][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:29:10,276][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:29:10,777][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:29:11,279][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:29:11,778][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:29:12,286][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:29:12,790][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:29:13,296][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:29:13,802][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:29:14,305][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:29:14,808][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:29:15,309][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:29:15,810][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:29:16,311][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:29:16,820][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:29:17,324][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:29:17,825][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:29:18,331][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:29:18,836][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:29:19,339][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:29:19,847][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:29:20,348][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:29:20,849][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:29:21,360][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:29:21,861][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:29:22,362][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:29:22,869][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:29:23,369][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:29:23,869][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:29:24,369][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:29:24,866][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:29:25,370][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:29:25,872][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:29:26,373][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:29:26,875][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:29:27,374][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:29:27,871][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:29:28,372][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:29:28,871][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:29:29,383][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:29:29,884][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:29:30,385][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:29:30,886][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:29:31,385][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:29:31,890][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:29:32,390][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:29:32,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10683 tokens. [2025-11-12 23:29:33,559][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:32 [2025-11-12 23:29:34,324][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:29:34,325][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:29:34,327][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:29:35,311][__main__][INFO] - Iteration 90 took 51s (30.20% Gen, 67.90% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 49m 38s. Estimated total time: 43h 9m 32s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 19s, 500 more iterations: 7h 11m 35s. [2025-11-12 23:29:35,314][__main__][INFO] - Starting iteration 90. [2025-11-12 23:29:36,143][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 8 and human policies 1. [2025-11-12 23:29:36,144][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:29:39,033][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:29:49,084][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 1, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:29:49,302][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls Given the per-item values: - I value hats at 10. - I value books at 10, but Bob values books at 10, suggesting he might hoard books to maximize his score. - I value balls at 1, while Bob values balls at 10, indicating he would probably take as many balls as possible. To maximize my score, I should focus on the items where my value is significantly higher than Bob's, which is the case for hats. By proposing to take all 10 hats, I can ensure I get the maximum value from this item. I will not propose taking any books or balls since doing so would most likely result in a lower overall score compared to the hats. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:29:53,664][__main__][INFO] - Number of regex retries in iteration 90: 3 [2025-11-12 23:29:53,665][__main__][INFO] - agents played in iteration 90 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:29:54,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:29:54,504][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:29:54,530][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:29:54,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:29:54,553][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:29:54,554][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:29:55,214][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:29:55,783][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:29:56,294][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:29:56,798][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:29:57,303][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:29:57,807][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:29:58,313][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:29:58,818][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:29:59,322][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:29:59,826][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:30:00,330][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:30:00,834][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:30:01,334][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:30:01,836][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:30:02,337][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:30:02,840][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:30:03,342][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:30:03,843][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:30:04,344][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:30:04,847][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:30:05,350][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:30:05,851][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:30:06,354][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:30:06,853][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:30:07,350][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:30:07,851][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:30:08,350][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:30:08,851][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:30:09,354][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:30:09,854][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:30:10,364][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:30:10,861][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:30:11,359][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:30:11,859][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:30:12,357][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:30:12,859][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:30:13,359][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:30:13,858][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:30:14,357][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:30:14,858][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:30:15,362][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:30:15,862][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:30:16,363][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:30:16,865][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:30:17,364][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:30:17,864][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:30:18,364][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:30:18,865][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:30:19,379][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:30:19,879][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:30:20,378][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:30:20,881][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:30:21,381][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:30:21,890][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:30:22,391][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:30:22,890][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:30:23,390][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:30:23,890][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:30:24,389][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:30:24,890][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:30:25,390][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:30:25,903][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:30:26,404][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:30:26,906][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:30:27,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10697 tokens. [2025-11-12 23:30:28,080][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.26%, Current % of VRAM taken: 58.51%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:32 [2025-11-12 23:30:28,855][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:30:28,856][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:30:28,858][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:30:30,628][__main__][INFO] - Iteration 91 took 54s (32.16% Gen, 64.59% Train). Generation: 17s, Training: 35s. Estimated remaining time: 44h 3m 27s. Estimated total time: 45h 24m 16s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 48s, 500 more iterations: 7h 34m 2s. [2025-11-12 23:30:30,630][__main__][INFO] - Starting iteration 91. [2025-11-12 23:30:31,167][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-12 23:30:31,167][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:30:34,409][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:30:47,558][__main__][INFO] - Number of regex retries in iteration 91: 1 [2025-11-12 23:30:47,559][__main__][INFO] - agents played in iteration 91 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:30:48,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:30:48,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:30:48,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:30:48,464][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:30:48,465][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:30:48,466][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:30:49,147][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:30:49,612][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:30:50,123][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:30:50,633][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:30:51,137][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:30:51,642][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:30:52,146][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:30:52,652][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:30:53,160][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:30:53,663][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:30:54,165][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:30:54,681][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:30:55,181][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:30:55,693][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:30:56,194][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:30:56,694][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:30:57,199][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:30:57,701][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:30:58,205][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:30:58,708][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:30:59,204][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:30:59,711][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:31:00,211][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:31:00,707][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:31:01,208][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:31:01,705][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:31:02,209][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:31:02,709][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:31:03,205][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:31:03,710][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:31:04,206][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:31:04,704][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:31:05,204][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:31:05,701][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:31:06,199][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:31:06,699][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:31:07,199][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:31:07,707][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:31:08,207][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:31:08,705][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:31:09,208][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:31:09,707][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:31:10,205][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:31:10,703][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:31:11,203][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:31:11,714][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:31:12,209][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:31:12,710][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:31:13,207][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:31:13,708][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:31:14,210][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:31:14,712][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:31:15,209][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:31:15,720][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:31:16,220][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:31:16,721][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:31:17,218][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:31:17,719][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:31:18,222][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:31:18,722][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:31:19,220][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:31:19,720][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:31:20,221][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:31:20,723][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:31:21,223][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10683 tokens. [2025-11-12 23:31:21,849][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:32 [2025-11-12 23:31:22,615][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:31:22,616][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:31:22,618][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:31:23,550][__main__][INFO] - Iteration 92 took 52s (31.29% Gen, 66.93% Train). Generation: 16s, Training: 35s. Estimated remaining time: 42h 17m 28s. Estimated total time: 43h 39m 10s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 18s, 500 more iterations: 7h 16m 31s. [2025-11-12 23:31:23,552][__main__][INFO] - Starting iteration 92. [2025-11-12 23:31:24,028][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-12 23:31:24,028][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:31:39,772][__main__][INFO] - Number of regex retries in iteration 92: 0 [2025-11-12 23:31:39,772][__main__][INFO] - agents played in iteration 92 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:31:40,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:31:40,661][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:31:40,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:31:40,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:31:40,714][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:31:40,714][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:31:41,380][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:31:41,839][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:31:42,348][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:31:42,848][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:31:43,351][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:31:43,852][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:31:44,356][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:31:44,859][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:31:45,363][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:31:45,871][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:31:46,397][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:31:46,904][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:31:47,422][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:31:47,927][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:31:48,433][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:31:48,934][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:31:49,435][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:31:49,939][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:31:50,439][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:31:50,937][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:31:51,440][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:31:51,941][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:31:52,442][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:31:52,946][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:31:53,448][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:31:53,953][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:31:54,454][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:31:54,955][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:31:55,460][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:31:56,063][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:31:56,568][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:31:57,070][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:31:57,573][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:31:58,079][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:31:58,576][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:31:59,078][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:31:59,579][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:32:00,079][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:32:00,582][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:32:01,080][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:32:01,579][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:32:02,091][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:32:02,590][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:32:03,095][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:32:03,596][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:32:04,097][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:32:04,600][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:32:05,100][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:32:05,601][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:32:06,102][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:32:06,604][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:32:07,124][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:32:07,627][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:32:08,129][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:32:08,629][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:32:09,130][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:32:09,633][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:32:10,134][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:32:10,638][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:32:11,143][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:32:11,643][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:32:12,145][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:32:12,645][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:32:13,148][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:32:13,651][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10815 tokens. [2025-11-12 23:32:14,304][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.38%, ΔTime: 00:00:32 [2025-11-12 23:32:15,091][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:32:15,093][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:32:15,095][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:32:16,011][__main__][INFO] - Iteration 93 took 51s (30.29% Gen, 67.95% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 56m 36s. Estimated total time: 43h 19m 11s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 38s, 500 more iterations: 7h 13m 11s. [2025-11-12 23:32:16,013][__main__][INFO] - Starting iteration 93. [2025-11-12 23:32:16,488][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-12 23:32:16,488][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:32:19,196][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:32:24,171][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:32:31,512][__main__][INFO] - Number of regex retries in iteration 93: 2 [2025-11-12 23:32:31,513][__main__][INFO] - agents played in iteration 93 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:32:32,334][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:32:32,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:32:32,387][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:32:32,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:32:32,411][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:32:32,412][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:32:33,094][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:32:33,563][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:32:34,074][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:32:34,579][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:32:35,082][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:32:35,585][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:32:36,093][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:32:36,595][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:32:37,100][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:32:37,606][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:32:38,111][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:32:38,630][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:32:39,131][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:32:39,629][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:32:40,132][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:32:40,633][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:32:41,136][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:32:41,638][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:32:42,140][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:32:42,648][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:32:43,148][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:32:43,651][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:32:44,151][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:32:44,652][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:32:45,157][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:32:45,659][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:32:46,160][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:32:46,663][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:32:47,167][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:32:47,669][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:32:48,170][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:32:48,670][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:32:49,171][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:32:49,674][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:32:50,171][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:32:50,678][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:32:51,181][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:32:51,683][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:32:52,185][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:32:52,684][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:32:53,182][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:32:53,683][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:32:54,184][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:32:54,686][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:32:55,188][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:32:55,688][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:32:56,189][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:32:56,689][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:32:57,191][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:32:57,690][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:32:58,189][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:32:58,691][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:32:59,190][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:32:59,691][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:33:00,190][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:33:00,689][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:33:01,189][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:33:01,689][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:33:02,189][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:33:02,689][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:33:03,188][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:33:03,688][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:33:04,188][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:33:04,688][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:33:05,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10812 tokens. [2025-11-12 23:33:05,864][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:32 [2025-11-12 23:33:06,653][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:33:06,655][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:33:06,657][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:33:07,634][__main__][INFO] - Iteration 94 took 51s (29.37% Gen, 68.71% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 13m 54s. Estimated total time: 42h 37m 20s. Time estimates for 10 more iterations: 8m 31s, 100 more iterations: 1h 25m 14s, 500 more iterations: 7h 6m 13s. [2025-11-12 23:33:07,636][__main__][INFO] - Starting iteration 94. [2025-11-12 23:33:08,167][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-12 23:33:08,167][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:33:11,980][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:33:23,785][__main__][INFO] - Number of regex retries in iteration 94: 1 [2025-11-12 23:33:23,785][__main__][INFO] - agents played in iteration 94 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:33:24,722][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:33:24,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:33:24,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:33:24,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:33:24,800][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:33:24,801][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:33:25,464][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:33:25,924][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:33:26,433][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:33:26,940][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:33:27,456][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:33:27,961][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:33:28,468][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:33:28,976][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:33:29,488][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:33:29,995][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:33:30,503][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:33:31,005][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:33:31,509][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:33:32,009][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:33:32,508][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:33:33,008][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:33:33,508][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:33:34,020][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:33:34,522][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:33:35,022][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:33:35,529][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:33:36,025][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:33:36,538][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:33:37,040][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:33:37,539][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:33:38,043][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:33:38,544][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:33:39,044][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:33:39,546][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:33:40,047][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:33:40,559][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:33:41,060][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:33:41,562][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:33:42,065][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:33:42,565][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:33:43,066][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:33:43,566][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:33:44,066][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:33:44,567][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:33:45,066][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:33:45,567][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:33:46,066][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:33:46,567][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:33:47,069][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:33:47,570][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:33:48,068][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:33:48,573][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:33:49,075][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:33:49,576][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:33:50,088][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:33:50,587][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:33:51,087][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:33:51,588][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:33:52,089][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:33:52,590][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:33:53,091][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:33:53,592][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:33:54,093][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:33:54,590][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:33:55,090][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:33:55,590][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:33:56,092][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:33:56,594][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:33:57,097][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:33:57,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10730 tokens. [2025-11-12 23:33:58,287][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.18%, ΔTime: 00:00:32 [2025-11-12 23:33:59,084][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:33:59,086][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:33:59,088][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:34:00,067][__main__][INFO] - Iteration 95 took 51s (30.09% Gen, 68.02% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 50m 43s. Estimated total time: 43h 15m 1s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 30s, 500 more iterations: 7h 12m 30s. [2025-11-12 23:34:00,070][__main__][INFO] - Starting iteration 95. [2025-11-12 23:34:00,608][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-12 23:34:00,608][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:34:16,725][__main__][INFO] - Number of regex retries in iteration 95: 0 [2025-11-12 23:34:16,726][__main__][INFO] - agents played in iteration 95 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:34:17,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:34:17,616][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:34:17,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:34:17,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:34:17,666][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:34:17,667][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:34:18,321][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:34:18,789][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:34:19,301][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:34:19,809][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:34:20,314][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:34:20,816][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:34:21,333][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:34:21,841][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:34:22,345][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:34:22,852][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:34:23,355][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:34:23,861][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:34:24,361][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:34:24,862][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:34:25,364][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:34:25,865][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:34:26,367][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:34:26,864][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:34:27,364][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:34:27,872][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:34:28,372][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:34:28,874][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:34:29,377][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:34:29,879][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:34:30,381][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:34:30,882][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:34:31,383][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:34:31,884][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:34:32,386][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:34:32,887][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:34:33,387][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:34:33,888][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:34:34,391][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:34:34,892][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:34:35,391][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:34:35,891][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:34:36,396][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:34:36,897][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:34:37,397][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:34:37,896][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:34:38,394][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:34:38,892][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:34:39,392][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:34:39,892][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:34:40,391][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:34:40,891][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:34:41,390][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:34:41,889][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:34:42,389][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:34:42,889][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:34:43,389][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:34:43,892][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:34:44,393][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:34:44,893][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:34:45,393][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:34:45,893][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:34:46,399][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:34:46,900][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:34:47,402][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:34:47,903][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:34:48,402][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:34:48,903][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:34:49,402][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:34:49,903][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:34:50,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10790 tokens. [2025-11-12 23:34:51,124][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:32 [2025-11-12 23:34:51,914][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:34:51,915][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:34:51,918][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:34:52,879][__main__][INFO] - Iteration 96 took 52s (30.83% Gen, 67.33% Train). Generation: 16s, Training: 35s. Estimated remaining time: 42h 8m 23s. Estimated total time: 43h 33m 34s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 7s, 500 more iterations: 7h 15m 35s. [2025-11-12 23:34:52,881][__main__][INFO] - Starting iteration 96. [2025-11-12 23:34:53,365][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-12 23:34:53,365][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:35:07,557][__main__][INFO] - Number of regex retries in iteration 96: 0 [2025-11-12 23:35:07,558][__main__][INFO] - agents played in iteration 96 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:35:08,386][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:35:08,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:35:08,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:35:08,465][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:35:08,466][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:35:08,467][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:35:09,173][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:35:09,636][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:35:10,160][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:35:10,665][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:35:11,172][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:35:11,679][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:35:12,184][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:35:12,691][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:35:13,194][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:35:13,693][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:35:14,199][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:35:14,703][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:35:15,208][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:35:15,710][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:35:16,208][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:35:16,712][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:35:17,212][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:35:17,713][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:35:18,220][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:35:18,722][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:35:19,221][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:35:19,723][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:35:20,223][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:35:20,722][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:35:21,223][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:35:21,723][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:35:22,228][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:35:22,727][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:35:23,229][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:35:23,730][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:35:24,228][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:35:24,726][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:35:25,224][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:35:25,723][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:35:26,227][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:35:26,725][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:35:27,224][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:35:27,723][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:35:28,223][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:35:28,726][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:35:29,226][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:35:29,727][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:35:30,226][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:35:30,727][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:35:31,229][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:35:31,728][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:35:32,229][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:35:32,741][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:35:33,243][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:35:33,751][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:35:34,250][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:35:34,751][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:35:35,269][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:35:35,770][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:35:36,269][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:35:36,770][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:35:37,274][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:35:37,781][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:35:38,282][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:35:38,783][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:35:39,286][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:35:39,789][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:35:40,289][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:35:40,790][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:35:41,292][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10768 tokens. [2025-11-12 23:35:41,967][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.28%, Current % of VRAM taken: 58.53%, Block Peak % of device VRAM: 62.35%, ΔTime: 00:00:32 [2025-11-12 23:35:42,716][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:35:42,717][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:35:42,719][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:35:43,729][__main__][INFO] - Iteration 97 took 50s (28.18% Gen, 69.81% Train). Generation: 14s, Training: 35s. Estimated remaining time: 40h 32m 11s. Estimated total time: 41h 58m 13s. Time estimates for 10 more iterations: 8m 23s, 100 more iterations: 1h 23m 56s, 500 more iterations: 6h 59m 42s. [2025-11-12 23:35:43,731][__main__][INFO] - Starting iteration 97. [2025-11-12 23:35:44,215][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-12 23:35:44,216][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:35:46,969][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:35:58,775][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 1 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:35:59,714][__main__][INFO] - Number of regex retries in iteration 97: 2 [2025-11-12 23:35:59,715][__main__][INFO] - agents played in iteration 97 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:36:00,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:36:00,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:36:00,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:36:00,793][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:36:00,794][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:36:00,795][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:36:01,481][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:36:01,942][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:36:02,445][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:36:02,963][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:36:03,461][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:36:03,960][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:36:04,466][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:36:04,967][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:36:05,475][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:36:05,975][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:36:06,476][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:36:06,985][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:36:07,485][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:36:07,985][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:36:08,485][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:36:08,982][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:36:09,485][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:36:09,991][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:36:10,494][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:36:10,996][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:36:11,498][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:36:12,002][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:36:12,503][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:36:13,007][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:36:13,510][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:36:14,018][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:36:14,519][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:36:15,022][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:36:15,523][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:36:16,024][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:36:16,526][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:36:17,026][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:36:17,529][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:36:18,028][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:36:18,527][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:36:19,029][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:36:19,527][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:36:20,025][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:36:20,525][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:36:21,022][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:36:21,530][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:36:22,031][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:36:22,533][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:36:23,032][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:36:23,531][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:36:24,030][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:36:24,532][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:36:25,030][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:36:25,535][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:36:26,035][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:36:26,535][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:36:27,035][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:36:27,534][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:36:28,037][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:36:28,540][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:36:29,045][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:36:29,546][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:36:30,045][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:36:30,559][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:36:31,060][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:36:31,558][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:36:32,062][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:36:32,561][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:36:33,070][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:36:33,568][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10802 tokens. [2025-11-12 23:36:34,263][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:32 [2025-11-12 23:36:35,041][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:36:35,043][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:36:35,044][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:36:36,037][__main__][INFO] - Iteration 98 took 51s (29.91% Gen, 68.17% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 44m 12s. Estimated total time: 43h 11m 7s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 22s, 500 more iterations: 7h 11m 51s. [2025-11-12 23:36:36,039][__main__][INFO] - Starting iteration 98. [2025-11-12 23:36:36,615][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-12 23:36:36,616][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:36:43,833][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:36:52,839][__main__][INFO] - Number of regex retries in iteration 98: 1 [2025-11-12 23:36:52,840][__main__][INFO] - agents played in iteration 98 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:36:53,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:36:53,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:36:53,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:36:53,851][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:36:53,851][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:36:53,852][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:36:54,527][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:36:54,988][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:36:55,491][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:36:55,994][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:36:56,496][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:36:56,999][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:36:57,505][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:36:58,007][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:36:58,510][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:36:59,018][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:36:59,526][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:37:00,029][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:37:00,527][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:37:01,028][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:37:01,529][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:37:02,028][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:37:02,529][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:37:03,029][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:37:03,528][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:37:04,031][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:37:04,531][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:37:05,031][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:37:05,531][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:37:06,034][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:37:06,534][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:37:07,034][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:37:07,538][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:37:08,041][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:37:08,542][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:37:09,043][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:37:09,546][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:37:10,047][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:37:10,549][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:37:11,046][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:37:11,543][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:37:12,049][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:37:12,546][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:37:13,044][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:37:13,543][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:37:14,042][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:37:14,542][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:37:15,045][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:37:15,550][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:37:16,063][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:37:16,563][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:37:17,069][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:37:17,569][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:37:18,068][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:37:18,583][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:37:19,081][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:37:19,579][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:37:20,081][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:37:20,580][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:37:21,087][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:37:21,584][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:37:22,083][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:37:22,584][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:37:23,125][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:37:23,628][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:37:24,130][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:37:24,632][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:37:25,139][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:37:25,646][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:37:26,153][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:37:26,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10827 tokens. [2025-11-12 23:37:27,375][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.28%, Current % of VRAM taken: 58.53%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:32 [2025-11-12 23:37:28,139][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:37:28,140][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:37:28,142][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:37:29,009][__main__][INFO] - Iteration 99 took 52s (30.96% Gen, 67.38% Train). Generation: 16s, Training: 35s. Estimated remaining time: 42h 11m 56s. Estimated total time: 43h 39m 43s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 19s, 500 more iterations: 7h 16m 37s. [2025-11-12 23:37:29,012][__main__][INFO] - Starting iteration 99. [2025-11-12 23:37:29,527][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-12 23:37:29,527][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:37:44,512][__main__][INFO] - Number of regex retries in iteration 99: 0 [2025-11-12 23:37:44,512][__main__][INFO] - agents played in iteration 99 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:37:45,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:37:45,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:37:45,432][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:37:45,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:37:45,456][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:37:45,457][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:37:46,125][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:37:46,589][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:37:47,102][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:37:47,606][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:37:48,110][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:37:48,616][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:37:49,122][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:37:49,624][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:37:50,126][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:37:50,630][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:37:51,137][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:37:51,641][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:37:52,140][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:37:52,642][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:37:53,144][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:37:53,647][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:37:54,157][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:37:54,660][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:37:55,161][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:37:55,664][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:37:56,166][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:37:56,683][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:37:57,184][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:37:57,683][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:37:58,185][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:37:58,684][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:37:59,194][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:37:59,694][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:38:00,193][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:38:00,693][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:38:01,194][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:38:01,695][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:38:02,194][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:38:02,695][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:38:03,203][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:38:03,704][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:38:04,205][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:38:04,705][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:38:05,206][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:38:05,714][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:38:06,213][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:38:06,712][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:38:07,216][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:38:07,718][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:38:08,219][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:38:08,719][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:38:09,219][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:38:09,722][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:38:10,222][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:38:10,720][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:38:11,220][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:38:11,718][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:38:12,218][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:38:12,719][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:38:13,217][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:38:13,720][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:38:14,220][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:38:14,721][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:38:15,222][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:38:15,720][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:38:16,225][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:38:16,726][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:38:17,229][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:38:17,735][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:38:18,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10844 tokens. [2025-11-12 23:38:18,973][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:32 [2025-11-12 23:38:19,721][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:38:19,723][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:38:19,724][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:38:20,624][__main__][INFO] - Iteration 100 took 51s (29.33% Gen, 68.91% Train). Generation: 14s, Training: 35s. Estimated remaining time: 41h 6m 15s. Estimated total time: 42h 34m 54s. Time estimates for 10 more iterations: 8m 30s, 100 more iterations: 1h 25m 9s, 500 more iterations: 7h 5m 49s. [2025-11-12 23:38:20,626][__main__][INFO] - Starting iteration 100. [2025-11-12 23:38:21,109][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 9 and human policies 1. [2025-11-12 23:38:21,110][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:38:35,533][__main__][INFO] - Number of regex retries in iteration 100: 0 [2025-11-12 23:38:35,534][__main__][INFO] - agents played in iteration 100 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:38:36,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:38:36,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:38:36,373][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:38:36,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:38:36,397][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:38:36,398][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:38:37,035][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:38:37,514][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:38:38,027][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:38:38,534][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:38:39,040][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:38:39,545][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:38:40,051][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:38:40,555][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:38:41,060][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:38:41,565][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:38:42,070][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:38:42,573][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:38:43,085][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:38:43,586][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:38:44,097][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:38:44,599][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:38:45,102][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:38:45,603][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:38:46,103][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:38:46,604][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:38:47,103][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:38:47,604][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:38:48,121][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:38:48,621][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:38:49,124][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:38:49,624][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:38:50,125][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:38:50,629][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:38:51,130][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:38:51,632][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:38:52,137][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:38:52,640][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:38:53,142][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:38:53,641][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:38:54,141][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:38:54,640][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:38:55,141][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:38:55,641][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:38:56,143][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:38:56,640][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:38:57,141][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:38:57,644][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:38:58,145][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:38:58,647][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:38:59,147][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:38:59,648][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:39:00,146][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:39:00,647][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:39:01,150][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:39:01,650][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:39:02,150][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:39:02,651][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:39:03,150][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:39:03,651][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:39:04,150][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:39:04,651][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:39:05,153][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:39:05,654][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:39:06,152][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:39:06,653][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:39:07,155][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:39:07,663][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:39:08,168][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:39:08,672][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:39:09,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10728 tokens. [2025-11-12 23:39:09,833][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:32 [2025-11-12 23:39:10,588][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:39:10,590][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:39:10,591][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:39:12,430][__main__][INFO] - Iteration 101 took 51s (28.10% Gen, 68.31% Train). Generation: 14s, Training: 35s. Estimated remaining time: 41h 16m 32s. Estimated total time: 42h 46m 3s. Time estimates for 10 more iterations: 8m 33s, 100 more iterations: 1h 25m 32s, 500 more iterations: 7h 7m 40s. [2025-11-12 23:39:12,433][__main__][INFO] - Starting iteration 101. [2025-11-12 23:39:12,922][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-12 23:39:12,923][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:39:28,363][__main__][INFO] - Number of regex retries in iteration 101: 0 [2025-11-12 23:39:28,364][__main__][INFO] - agents played in iteration 101 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:39:29,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:39:29,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:39:29,202][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:39:29,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:39:29,226][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:39:29,226][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:39:29,894][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:39:30,362][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:39:30,869][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:39:31,370][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:39:31,874][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:39:32,381][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:39:32,897][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:39:33,394][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:39:33,897][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:39:34,398][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:39:34,901][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:39:35,405][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:39:35,905][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:39:36,408][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:39:36,910][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:39:37,416][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:39:37,932][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:39:38,436][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:39:38,939][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:39:39,442][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:39:39,943][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:39:40,445][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:39:40,945][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:39:41,446][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:39:41,949][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:39:42,451][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:39:42,951][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:39:43,452][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:39:43,973][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:39:44,479][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:39:44,982][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:39:45,483][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:39:45,996][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:39:46,503][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:39:47,014][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:39:47,516][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:39:48,020][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:39:48,530][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:39:49,031][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:39:49,533][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:39:50,036][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:39:50,537][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:39:51,035][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:39:51,537][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:39:52,035][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:39:52,540][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:39:53,040][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:39:53,541][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:39:54,042][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:39:54,544][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:39:55,046][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:39:55,546][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:39:56,047][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:39:56,550][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:39:57,049][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:39:57,553][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:39:58,059][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:39:58,564][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:39:59,067][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:39:59,570][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:40:00,073][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:40:00,579][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:40:01,082][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:40:01,593][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:40:02,107][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10825 tokens. [2025-11-12 23:40:02,806][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.49%, ΔTime: 00:00:32 [2025-11-12 23:40:03,551][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:40:03,553][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:40:03,555][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:40:04,478][__main__][INFO] - Iteration 102 took 51s (29.95% Gen, 68.26% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 27m 28s. Estimated total time: 42h 57m 50s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 55s, 500 more iterations: 7h 9m 38s. [2025-11-12 23:40:04,480][__main__][INFO] - Starting iteration 102. [2025-11-12 23:40:04,966][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-12 23:40:04,967][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:40:20,424][__main__][INFO] - Number of regex retries in iteration 102: 0 [2025-11-12 23:40:20,425][__main__][INFO] - agents played in iteration 102 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:40:21,252][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:40:21,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:40:21,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:40:21,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:40:21,325][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:40:21,325][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:40:21,962][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:40:22,423][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:40:22,939][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:40:23,448][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:40:23,953][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:40:24,462][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:40:24,964][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:40:25,468][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:40:25,971][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:40:26,473][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:40:26,977][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:40:27,480][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:40:27,982][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:40:28,486][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:40:28,989][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:40:29,493][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:40:29,997][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:40:30,500][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:40:31,003][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:40:31,504][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:40:32,005][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:40:32,504][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:40:33,004][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:40:33,516][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:40:34,014][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:40:34,515][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:40:35,025][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:40:35,527][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:40:36,039][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:40:36,543][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:40:37,049][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:40:37,554][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:40:38,055][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:40:38,555][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:40:39,060][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:40:39,561][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:40:40,065][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:40:40,567][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:40:41,068][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:40:41,569][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:40:42,071][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:40:42,571][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:40:43,071][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:40:43,571][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:40:44,078][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:40:44,582][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:40:45,089][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:40:45,593][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:40:46,094][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:40:46,595][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:40:47,094][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:40:47,596][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:40:48,099][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:40:48,599][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:40:49,097][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:40:49,596][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:40:50,101][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:40:50,605][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:40:51,108][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:40:51,611][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:40:52,115][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:40:52,623][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:40:53,140][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:40:53,644][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:40:54,149][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10848 tokens. [2025-11-12 23:40:54,810][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:32 [2025-11-12 23:40:55,579][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:40:55,581][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:40:55,583][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:40:56,528][__main__][INFO] - Iteration 103 took 51s (29.98% Gen, 68.18% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 26m 54s. Estimated total time: 42h 58m 9s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 56s, 500 more iterations: 7h 9m 41s. [2025-11-12 23:40:56,530][__main__][INFO] - Starting iteration 103. [2025-11-12 23:40:57,017][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-12 23:40:57,018][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:41:05,566][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:41:11,678][__main__][INFO] - Number of regex retries in iteration 103: 1 [2025-11-12 23:41:11,679][__main__][INFO] - agents played in iteration 103 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:41:12,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:41:12,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:41:12,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:41:12,586][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:41:12,587][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:41:12,588][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:41:13,235][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:41:13,695][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:41:14,206][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:41:14,709][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:41:15,215][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:41:15,720][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:41:16,224][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:41:16,729][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:41:17,231][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:41:17,737][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:41:18,253][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:41:18,755][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:41:19,261][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:41:19,762][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:41:20,263][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:41:20,769][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:41:21,271][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:41:21,775][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:41:22,282][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:41:22,780][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:41:23,281][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:41:23,781][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:41:24,280][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:41:24,782][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:41:25,282][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:41:25,783][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:41:26,283][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:41:26,784][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:41:27,288][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:41:27,787][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:41:28,289][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:41:28,791][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:41:29,290][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:41:29,789][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:41:30,288][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:41:30,788][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:41:31,289][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:41:31,789][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:41:32,289][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:41:32,789][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:41:33,288][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:41:33,787][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:41:34,290][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:41:34,791][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:41:35,293][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:41:35,793][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:41:36,292][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:41:36,792][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:41:37,293][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:41:37,794][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:41:38,297][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:41:38,797][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:41:39,298][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:41:39,799][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:41:40,304][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:41:40,806][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:41:41,310][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:41:41,812][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:41:42,316][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:41:42,819][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:41:43,324][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:41:43,828][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:41:44,346][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:41:44,848][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:41:45,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10820 tokens. [2025-11-12 23:41:46,035][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:32 [2025-11-12 23:41:46,807][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:41:46,808][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:41:46,810][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:41:47,751][__main__][INFO] - Iteration 104 took 50s (28.90% Gen, 69.25% Train). Generation: 14s, Training: 35s. Estimated remaining time: 40h 44m 36s. Estimated total time: 42h 16m 43s. Time estimates for 10 more iterations: 8m 27s, 100 more iterations: 1h 24m 33s, 500 more iterations: 7h 2m 47s. [2025-11-12 23:41:47,753][__main__][INFO] - Starting iteration 104. [2025-11-12 23:41:48,229][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-12 23:41:48,230][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:41:50,848][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:41:58,598][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 1 z_ball did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:42:01,829][__main__][INFO] - Number of regex retries in iteration 104: 2 [2025-11-12 23:42:01,830][__main__][INFO] - agents played in iteration 104 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:42:02,616][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:42:02,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:42:02,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:42:02,696][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:42:02,696][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:42:02,698][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:42:03,325][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:42:03,782][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:42:04,288][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:42:04,789][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:42:05,291][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:42:05,794][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:42:06,295][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:42:06,797][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:42:07,299][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:42:07,808][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:42:08,316][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:42:08,820][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:42:09,322][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:42:09,833][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:42:10,340][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:42:10,860][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:42:11,360][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:42:11,860][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:42:12,363][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:42:12,868][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:42:13,377][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:42:13,879][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:42:14,380][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:42:14,882][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:42:15,384][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:42:15,884][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:42:16,382][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:42:16,883][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:42:17,383][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:42:17,884][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:42:18,383][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:42:18,885][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:42:19,385][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:42:19,884][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:42:20,381][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:42:20,880][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:42:21,380][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:42:21,876][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:42:22,374][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:42:22,875][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:42:23,374][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:42:23,871][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:42:24,372][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:42:24,872][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:42:25,372][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:42:25,871][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:42:26,371][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:42:26,875][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:42:27,376][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:42:27,875][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:42:28,377][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:42:28,877][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:42:29,382][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:42:29,887][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:42:30,392][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:42:30,897][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:42:31,400][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:42:31,903][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:42:32,407][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:42:32,910][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:42:33,416][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:42:33,920][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:42:34,424][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:42:34,932][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:42:35,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10852 tokens. [2025-11-12 23:42:36,073][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:32 [2025-11-12 23:42:36,838][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:42:36,840][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:42:36,842][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:42:37,753][__main__][INFO] - Iteration 105 took 49s (27.46% Gen, 70.70% Train). Generation: 13s, Training: 35s. Estimated remaining time: 39h 43m 17s. Estimated total time: 41h 16m 13s. Time estimates for 10 more iterations: 8m 15s, 100 more iterations: 1h 22m 32s, 500 more iterations: 6h 52m 42s. [2025-11-12 23:42:37,755][__main__][INFO] - Starting iteration 105. [2025-11-12 23:42:38,253][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-12 23:42:38,254][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:42:52,453][__main__][INFO] - Number of regex retries in iteration 105: 0 [2025-11-12 23:42:52,453][__main__][INFO] - agents played in iteration 105 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:42:53,284][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:42:53,306][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:42:53,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:42:53,350][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:42:53,351][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:42:53,352][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:42:53,989][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:42:54,456][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:42:54,965][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:42:55,468][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:42:55,974][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:42:56,477][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:42:56,982][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:42:57,486][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:42:57,990][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:42:58,494][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:42:58,997][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:42:59,501][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:43:00,006][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:43:00,513][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:43:01,021][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:43:01,525][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:43:02,024][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:43:02,529][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:43:03,030][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:43:03,537][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:43:04,053][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:43:04,555][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:43:05,068][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:43:05,572][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:43:06,095][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:43:06,597][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:43:07,100][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:43:07,604][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:43:08,105][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:43:08,605][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:43:09,107][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:43:09,608][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:43:10,107][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:43:10,608][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:43:11,106][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:43:11,608][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:43:12,108][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:43:12,609][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:43:13,110][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:43:13,609][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:43:14,110][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:43:14,609][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:43:15,109][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:43:15,610][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:43:16,110][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:43:16,610][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:43:17,110][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:43:17,610][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:43:18,111][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:43:18,611][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:43:19,112][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:43:19,614][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:43:20,117][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:43:20,619][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:43:21,124][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:43:21,628][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:43:22,134][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:43:22,641][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:43:23,145][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:43:23,655][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:43:24,159][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:43:24,673][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:43:25,177][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:43:25,680][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:43:26,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10858 tokens. [2025-11-12 23:43:26,825][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:32 [2025-11-12 23:43:27,635][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:43:27,636][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:43:27,638][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:43:28,602][__main__][INFO] - Iteration 106 took 50s (28.20% Gen, 69.88% Train). Generation: 14s, Training: 35s. Estimated remaining time: 40h 23m 42s. Estimated total time: 41h 57m 29s. Time estimates for 10 more iterations: 8m 23s, 100 more iterations: 1h 23m 54s, 500 more iterations: 6h 59m 34s. [2025-11-12 23:43:28,605][__main__][INFO] - Starting iteration 106. [2025-11-12 23:43:29,099][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-12 23:43:29,100][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:43:44,928][__main__][INFO] - Number of regex retries in iteration 106: 0 [2025-11-12 23:43:44,928][__main__][INFO] - agents played in iteration 106 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:43:45,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:43:45,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:43:45,811][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:43:45,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:43:45,835][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:43:45,836][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:43:46,471][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:43:46,937][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:43:47,446][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:43:47,949][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:43:48,454][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:43:48,956][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:43:49,458][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:43:49,965][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:43:50,468][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:43:50,988][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:43:51,489][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:43:51,990][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:43:52,496][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:43:53,002][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:43:53,509][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:43:54,018][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:43:54,522][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:43:55,029][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:43:55,531][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:43:56,033][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:43:56,536][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:43:57,040][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:43:57,549][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:43:58,052][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:43:58,553][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:43:59,059][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:43:59,563][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:44:00,080][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:44:00,582][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:44:01,084][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:44:01,588][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:44:02,113][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:44:02,621][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:44:03,121][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:44:03,623][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:44:04,128][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:44:04,629][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:44:05,130][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:44:05,629][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:44:06,131][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:44:06,636][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:44:07,139][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:44:07,645][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:44:08,148][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:44:08,650][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:44:09,154][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:44:09,656][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:44:10,157][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:44:10,673][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:44:11,174][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:44:11,683][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:44:12,184][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:44:12,690][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:44:13,196][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:44:13,701][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:44:14,207][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:44:14,708][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:44:15,221][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:44:15,733][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:44:16,237][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:44:16,742][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:44:17,245][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:44:17,748][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:44:18,252][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:44:18,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10852 tokens. [2025-11-12 23:44:19,404][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.30%, Current % of VRAM taken: 58.55%, Block Peak % of device VRAM: 62.49%, ΔTime: 00:00:32 [2025-11-12 23:44:20,171][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:44:20,173][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:44:20,174][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:44:21,095][__main__][INFO] - Iteration 107 took 51s (30.44% Gen, 67.79% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 45m 9s. Estimated total time: 43h 19m 48s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 39s, 500 more iterations: 7h 13m 18s. [2025-11-12 23:44:21,097][__main__][INFO] - Starting iteration 107. [2025-11-12 23:44:21,566][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-12 23:44:21,567][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:44:36,110][__main__][INFO] - Number of regex retries in iteration 107: 0 [2025-11-12 23:44:36,110][__main__][INFO] - agents played in iteration 107 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:44:36,894][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:44:36,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:44:36,938][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:44:36,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:44:36,960][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:44:36,961][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:44:37,624][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:44:38,088][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:44:38,595][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:44:39,097][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:44:39,599][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:44:40,098][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:44:40,610][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:44:41,111][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:44:41,613][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:44:42,119][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:44:42,624][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:44:43,129][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:44:43,632][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:44:44,138][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:44:44,642][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:44:45,144][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:44:45,647][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:44:46,150][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:44:46,653][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:44:47,158][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:44:47,665][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:44:48,168][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:44:48,683][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:44:49,188][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:44:49,699][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:44:50,233][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:44:50,737][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:44:51,241][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:44:51,744][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:44:52,246][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:44:52,749][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:44:53,251][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:44:53,757][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:44:54,258][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:44:54,760][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:44:55,261][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:44:55,762][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:44:56,276][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:44:56,777][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:44:57,279][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:44:57,788][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:44:58,290][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:44:58,792][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:44:59,294][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:44:59,795][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:45:00,310][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:45:00,812][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:45:01,317][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:45:01,823][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:45:02,326][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:45:02,833][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:45:03,337][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:45:03,836][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:45:04,345][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:45:04,850][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:45:05,353][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:45:05,856][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:45:06,359][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:45:06,863][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:45:07,364][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:45:07,867][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:45:08,380][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:45:08,882][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:45:09,394][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:45:09,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10801 tokens. [2025-11-12 23:45:10,563][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:32 [2025-11-12 23:45:11,329][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:45:11,330][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:45:11,332][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:45:12,301][__main__][INFO] - Iteration 108 took 50s (28.67% Gen, 69.42% Train). Generation: 14s, Training: 35s. Estimated remaining time: 40h 41m 14s. Estimated total time: 42h 16m 45s. Time estimates for 10 more iterations: 8m 27s, 100 more iterations: 1h 24m 33s, 500 more iterations: 7h 2m 47s. [2025-11-12 23:45:12,303][__main__][INFO] - Starting iteration 108. [2025-11-12 23:45:12,789][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-12 23:45:12,790][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:45:22,055][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:45:28,096][__main__][INFO] - Number of regex retries in iteration 108: 1 [2025-11-12 23:45:28,096][__main__][INFO] - agents played in iteration 108 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:45:28,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:45:28,920][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:45:28,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:45:28,967][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:45:28,967][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:45:28,969][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:45:29,584][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:45:30,043][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:45:30,554][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:45:31,055][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:45:31,560][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:45:32,062][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:45:32,566][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:45:33,069][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:45:33,572][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:45:34,072][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:45:34,588][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:45:35,090][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:45:35,599][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:45:36,099][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:45:36,599][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:45:37,102][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:45:37,604][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:45:38,109][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:45:38,614][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:45:39,118][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:45:39,629][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:45:40,131][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:45:40,635][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:45:41,137][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:45:41,636][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:45:42,139][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:45:42,639][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:45:43,138][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:45:43,641][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:45:44,142][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:45:44,639][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:45:45,138][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:45:45,647][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:45:46,153][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:45:46,656][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:45:47,157][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:45:47,659][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:45:48,160][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:45:48,663][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:45:49,163][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:45:49,663][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:45:50,164][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:45:50,666][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:45:51,166][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:45:51,670][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:45:52,172][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:45:52,682][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:45:53,189][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:45:53,692][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:45:54,196][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:45:54,698][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:45:55,211][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:45:55,711][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:45:56,212][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:45:56,714][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:45:57,212][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:45:57,709][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:45:58,212][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:45:58,710][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:45:59,217][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:45:59,719][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:46:00,222][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:46:00,730][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:46:01,229][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:46:01,731][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10858 tokens. [2025-11-12 23:46:02,372][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:32 [2025-11-12 23:46:03,139][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:46:03,141][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:46:03,142][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:46:04,096][__main__][INFO] - Iteration 109 took 51s (29.83% Gen, 68.31% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 8m 58s. Estimated total time: 42h 45m 20s. Time estimates for 10 more iterations: 8m 33s, 100 more iterations: 1h 25m 30s, 500 more iterations: 7h 7m 33s. [2025-11-12 23:46:04,098][__main__][INFO] - Starting iteration 109. [2025-11-12 23:46:04,592][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-12 23:46:04,593][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:46:15,797][mllm.models.large_language_model_local][WARNING] - Response Proposal: x hats, y books, z balls Given the values: - You value hats and books at 1. - You value balls at 10. - Bob values hats and books at 10. - Bob values balls at 1. The best strategy here is to maximize the value from items you highly value, which are the balls. Since Bob values balls very low, he is unlikely to propose taking many balls, while valuing your hats and books more. This means you should allocate as many balls as possible to yourself. Thus, I propose: Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:46:19,299][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:46:22,153][__main__][INFO] - Number of regex retries in iteration 109: 2 [2025-11-12 23:46:22,153][__main__][INFO] - agents played in iteration 109 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:46:23,035][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:46:23,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:46:23,089][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:46:23,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:46:23,112][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:46:23,113][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:46:23,740][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:46:24,217][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:46:24,724][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:46:25,232][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:46:25,739][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:46:26,247][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:46:26,758][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:46:27,263][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:46:27,774][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:46:28,284][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:46:28,789][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:46:29,299][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:46:29,801][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:46:30,304][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:46:30,809][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:46:31,314][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:46:31,821][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:46:32,321][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:46:32,824][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:46:33,337][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:46:33,840][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:46:34,342][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:46:34,846][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:46:35,346][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:46:35,848][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:46:36,351][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:46:36,857][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:46:37,361][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:46:37,865][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:46:38,366][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:46:38,868][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:46:39,370][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:46:39,873][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:46:40,371][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:46:40,872][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:46:41,372][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:46:41,873][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:46:42,378][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:46:42,882][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:46:43,387][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:46:43,894][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:46:44,397][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:46:44,901][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:46:45,413][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:46:45,912][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:46:46,424][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:46:46,927][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:46:47,430][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:46:47,939][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:46:48,442][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:46:48,945][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:46:49,449][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:46:49,951][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:46:50,456][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:46:50,959][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:46:51,462][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:46:51,964][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:46:52,469][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:46:52,968][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:46:53,472][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:46:53,977][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:46:54,482][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:46:54,983][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:46:55,486][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:46:55,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10811 tokens. [2025-11-12 23:46:56,671][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.39%, ΔTime: 00:00:32 [2025-11-12 23:46:57,449][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:46:57,450][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:46:57,452][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:46:58,361][__main__][INFO] - Iteration 110 took 53s (32.66% Gen, 65.65% Train). Generation: 17s, Training: 35s. Estimated remaining time: 43h 11m 9s. Estimated total time: 44h 48m 26s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 36s, 500 more iterations: 7h 28m 4s. [2025-11-12 23:46:58,363][__main__][INFO] - Starting iteration 110. [2025-11-12 23:46:58,853][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 10 and human policies 1. [2025-11-12 23:46:58,854][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:47:14,919][__main__][INFO] - Number of regex retries in iteration 110: 0 [2025-11-12 23:47:14,919][__main__][INFO] - agents played in iteration 110 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:47:15,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:47:15,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:47:15,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:47:15,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:47:15,829][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:47:15,830][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:47:16,477][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:47:16,944][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:47:17,454][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:47:17,962][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:47:18,471][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:47:18,975][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:47:19,484][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:47:19,990][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:47:20,494][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:47:21,000][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:47:21,514][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:47:22,051][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:47:22,554][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:47:23,062][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:47:23,563][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:47:24,065][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:47:24,568][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:47:25,071][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:47:25,578][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:47:26,085][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:47:26,588][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:47:27,090][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:47:27,593][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:47:28,100][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:47:28,618][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:47:29,120][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:47:29,621][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:47:30,123][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:47:30,626][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:47:31,142][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:47:31,642][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:47:32,147][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:47:32,646][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:47:33,148][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:47:33,654][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:47:34,162][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:47:34,667][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:47:35,174][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:47:35,679][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:47:36,180][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:47:36,685][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:47:37,188][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:47:37,693][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:47:38,197][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:47:38,700][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:47:39,214][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:47:39,717][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:47:40,233][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:47:40,733][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:47:41,235][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:47:41,739][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:47:42,241][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:47:42,746][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:47:43,248][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:47:43,751][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:47:44,252][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:47:44,755][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:47:45,256][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:47:45,758][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:47:46,261][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:47:46,763][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:47:47,265][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:47:47,764][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:47:48,268][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:47:48,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10854 tokens. [2025-11-12 23:47:49,420][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.26%, Current % of VRAM taken: 58.51%, Block Peak % of device VRAM: 62.50%, ΔTime: 00:00:32 [2025-11-12 23:47:50,188][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:47:50,190][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:47:50,191][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:47:51,996][__main__][INFO] - Iteration 111 took 53s (30.23% Gen, 66.37% Train). Generation: 16s, Training: 35s. Estimated remaining time: 42h 38m 59s. Estimated total time: 44h 17m 10s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 34s, 500 more iterations: 7h 22m 51s. [2025-11-12 23:47:51,998][__main__][INFO] - Starting iteration 111. [2025-11-12 23:47:52,472][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-12 23:47:52,473][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:48:07,166][__main__][INFO] - Number of regex retries in iteration 111: 0 [2025-11-12 23:48:07,167][__main__][INFO] - agents played in iteration 111 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:48:07,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:48:07,986][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:48:08,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:48:08,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:48:08,030][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:48:08,031][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:48:08,677][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:48:09,140][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:48:09,648][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:48:10,160][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:48:10,663][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:48:11,165][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:48:11,668][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:48:12,172][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:48:12,674][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:48:13,185][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:48:13,688][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:48:14,196][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:48:14,702][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:48:15,204][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:48:15,716][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:48:16,222][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:48:16,743][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:48:17,247][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:48:17,749][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:48:18,255][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:48:18,757][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:48:19,260][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:48:19,760][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:48:20,262][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:48:20,767][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:48:21,267][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:48:21,768][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:48:22,270][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:48:22,769][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:48:23,272][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:48:23,771][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:48:24,271][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:48:24,772][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:48:25,274][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:48:25,775][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:48:26,275][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:48:26,779][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:48:27,281][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:48:27,785][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:48:28,291][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:48:28,795][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:48:29,296][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:48:29,800][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:48:30,304][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:48:30,806][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:48:31,310][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:48:31,813][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:48:32,317][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:48:32,820][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:48:33,325][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:48:33,829][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:48:34,333][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:48:34,840][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:48:35,345][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:48:35,848][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:48:36,366][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:48:36,868][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:48:37,370][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:48:37,872][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:48:38,375][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:48:38,880][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:48:39,383][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:48:39,886][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:48:40,389][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:48:40,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10821 tokens. [2025-11-12 23:48:41,563][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:32 [2025-11-12 23:48:42,330][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:48:42,332][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:48:42,333][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:48:43,270][__main__][INFO] - Iteration 112 took 50s (28.92% Gen, 69.23% Train). Generation: 14s, Training: 35s. Estimated remaining time: 40h 40m 56s. Estimated total time: 42h 19m 57s. Time estimates for 10 more iterations: 8m 27s, 100 more iterations: 1h 24m 39s, 500 more iterations: 7h 3m 19s. [2025-11-12 23:48:43,272][__main__][INFO] - Starting iteration 112. [2025-11-12 23:48:43,763][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-12 23:48:43,764][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:48:59,151][__main__][INFO] - Number of regex retries in iteration 112: 0 [2025-11-12 23:48:59,151][__main__][INFO] - agents played in iteration 112 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:48:59,948][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:48:59,975][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:49:00,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:49:00,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:49:00,023][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:49:00,024][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:49:00,655][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:49:01,122][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:49:01,631][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:49:02,134][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:49:02,637][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:49:03,144][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:49:03,659][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:49:04,161][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:49:04,663][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:49:05,173][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:49:05,678][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:49:06,184][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:49:06,688][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:49:07,192][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:49:07,702][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:49:08,204][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:49:08,705][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:49:09,208][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:49:09,709][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:49:10,211][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:49:10,713][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:49:11,213][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:49:11,714][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:49:12,215][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:49:12,720][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:49:13,219][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:49:13,720][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:49:14,221][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:49:14,722][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:49:15,223][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:49:15,725][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:49:16,224][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:49:16,728][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:49:17,233][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:49:17,733][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:49:18,236][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:49:18,737][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:49:19,239][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:49:19,756][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:49:20,262][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:49:20,775][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:49:21,282][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:49:21,789][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:49:22,302][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:49:22,825][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:49:23,328][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:49:23,838][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:49:24,342][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:49:24,849][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:49:25,349][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:49:25,852][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:49:26,355][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:49:26,860][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:49:27,379][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:49:27,883][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:49:28,386][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:49:28,895][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:49:29,398][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:49:29,899][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:49:30,403][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:49:30,904][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:49:31,412][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:49:31,915][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:49:32,420][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:49:32,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10846 tokens. [2025-11-12 23:49:33,579][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:32 [2025-11-12 23:49:34,376][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:49:34,378][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:49:34,380][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:49:35,321][__main__][INFO] - Iteration 113 took 51s (29.85% Gen, 68.33% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 18m 2s. Estimated total time: 42h 57m 56s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 55s, 500 more iterations: 7h 9m 39s. [2025-11-12 23:49:35,323][__main__][INFO] - Starting iteration 113. [2025-11-12 23:49:35,823][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-12 23:49:35,824][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:49:51,904][__main__][INFO] - Number of regex retries in iteration 113: 0 [2025-11-12 23:49:51,904][__main__][INFO] - agents played in iteration 113 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:49:52,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:49:52,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:49:52,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:49:52,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:49:52,803][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:49:52,804][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:49:53,431][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:49:53,891][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:49:54,397][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:49:54,903][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:49:55,408][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:49:55,915][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:49:56,421][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:49:56,923][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:49:57,425][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:49:57,928][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:49:58,434][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:49:58,944][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:49:59,449][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:49:59,950][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:50:00,463][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:50:00,962][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:50:01,470][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:50:01,974][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:50:02,473][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:50:02,976][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:50:03,476][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:50:03,980][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:50:04,481][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:50:04,981][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:50:05,504][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:50:06,024][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:50:06,530][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:50:07,032][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:50:07,533][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:50:08,034][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:50:08,535][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:50:09,037][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:50:09,542][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:50:10,043][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:50:10,548][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:50:11,053][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:50:11,556][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:50:12,060][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:50:12,562][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:50:13,067][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:50:13,574][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:50:14,078][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:50:14,581][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:50:15,089][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:50:15,591][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:50:16,109][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:50:16,612][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:50:17,110][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:50:17,615][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:50:18,117][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:50:18,620][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:50:19,119][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:50:19,623][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:50:20,127][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:50:20,630][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:50:21,131][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:50:21,634][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:50:22,135][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:50:22,639][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:50:23,141][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:50:23,643][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:50:24,145][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:50:24,648][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:50:25,151][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:50:25,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10804 tokens. [2025-11-12 23:50:26,339][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:32 [2025-11-12 23:50:27,103][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:50:27,104][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:50:27,106][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:50:28,030][__main__][INFO] - Iteration 114 took 52s (30.80% Gen, 67.43% Train). Generation: 16s, Training: 35s. Estimated remaining time: 41h 49m 37s. Estimated total time: 43h 30m 23s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 0s, 500 more iterations: 7h 15m 3s. [2025-11-12 23:50:28,032][__main__][INFO] - Starting iteration 114. [2025-11-12 23:50:28,502][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-12 23:50:28,503][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:50:31,445][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:50:32,384][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:50:43,861][__main__][INFO] - Number of regex retries in iteration 114: 2 [2025-11-12 23:50:43,862][__main__][INFO] - agents played in iteration 114 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:50:44,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:50:44,717][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:50:44,755][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:50:44,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:50:44,777][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:50:44,778][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:50:45,405][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:50:45,865][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:50:46,380][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:50:46,885][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:50:47,389][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:50:47,900][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:50:48,406][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:50:48,920][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:50:49,426][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:50:49,936][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:50:50,444][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:50:50,950][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:50:51,457][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:50:51,961][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:50:52,470][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:50:52,978][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:50:53,482][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:50:53,986][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:50:54,487][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:50:54,988][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:50:55,499][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:50:56,000][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:50:56,499][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:50:57,006][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:50:57,507][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:50:58,021][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:50:58,523][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:50:59,026][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:50:59,539][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:51:00,043][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:51:00,548][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:51:01,056][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:51:01,560][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:51:02,063][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:51:02,566][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:51:03,069][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:51:03,574][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:51:04,077][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:51:04,582][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:51:05,085][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:51:05,587][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:51:06,090][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:51:06,589][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:51:07,092][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:51:07,594][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:51:08,097][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:51:08,605][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:51:09,107][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:51:09,606][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:51:10,115][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:51:10,617][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:51:11,128][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:51:11,629][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:51:12,128][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:51:12,629][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:51:13,133][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:51:13,638][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:51:14,145][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:51:14,649][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:51:15,154][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:51:15,658][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:51:16,164][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:51:16,673][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:51:17,183][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:51:17,691][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10857 tokens. [2025-11-12 23:51:18,392][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.29%, Current % of VRAM taken: 58.54%, Block Peak % of device VRAM: 62.49%, ΔTime: 00:00:32 [2025-11-12 23:51:19,181][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:51:19,182][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:51:19,184][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:51:20,116][__main__][INFO] - Iteration 115 took 51s (29.76% Gen, 68.44% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 19m 5s. Estimated total time: 43h 0m 43s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 1s, 500 more iterations: 7h 10m 7s. [2025-11-12 23:51:20,118][__main__][INFO] - Starting iteration 115. [2025-11-12 23:51:20,607][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-12 23:51:20,608][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:51:29,514][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:51:34,625][__main__][INFO] - Number of regex retries in iteration 115: 1 [2025-11-12 23:51:34,626][__main__][INFO] - agents played in iteration 115 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:51:35,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:51:35,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:51:35,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:51:35,608][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:51:35,609][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:51:35,610][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:51:36,259][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:51:36,718][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:51:37,235][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:51:37,735][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:51:38,247][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:51:38,754][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:51:39,258][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:51:39,764][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:51:40,270][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:51:40,776][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:51:41,287][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:51:41,791][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:51:42,292][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:51:42,799][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:51:43,309][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:51:43,828][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:51:44,329][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:51:44,827][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:51:45,329][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:51:45,830][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:51:46,338][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:51:46,836][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:51:47,338][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:51:47,845][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:51:48,347][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:51:48,883][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:51:49,388][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:51:49,890][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:51:50,392][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:51:50,894][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:51:51,397][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:51:51,904][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:51:52,410][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:51:52,911][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:51:53,416][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:51:53,916][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:51:54,435][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:51:54,940][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:51:55,444][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:51:55,948][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:51:56,452][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:51:56,959][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:51:57,463][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:51:57,964][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:51:58,469][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:51:58,971][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:51:59,472][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:51:59,976][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:52:00,477][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:52:00,978][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:52:01,481][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:52:01,983][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:52:02,485][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:52:02,987][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:52:03,489][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:52:03,991][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:52:04,491][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:52:04,994][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:52:05,494][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:52:05,997][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:52:06,507][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:52:07,007][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:52:07,513][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:52:08,014][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:52:08,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10839 tokens. [2025-11-12 23:52:09,201][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.37%, ΔTime: 00:00:32 [2025-11-12 23:52:09,975][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:52:09,977][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:52:09,979][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:52:10,883][__main__][INFO] - Iteration 116 took 50s (27.88% Gen, 70.32% Train). Generation: 14s, Training: 35s. Estimated remaining time: 40h 11m 21s. Estimated total time: 41h 53m 50s. Time estimates for 10 more iterations: 8m 22s, 100 more iterations: 1h 23m 47s, 500 more iterations: 6h 58m 58s. [2025-11-12 23:52:10,885][__main__][INFO] - Starting iteration 116. [2025-11-12 23:52:11,397][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-12 23:52:11,399][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:52:14,878][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:52:27,788][__main__][INFO] - Number of regex retries in iteration 116: 1 [2025-11-12 23:52:27,788][__main__][INFO] - agents played in iteration 116 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:52:28,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:52:28,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:52:28,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:52:28,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:52:28,705][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:52:28,706][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:52:29,338][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:52:29,799][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:52:30,310][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:52:30,810][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:52:31,316][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:52:31,823][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:52:32,330][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:52:32,839][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:52:33,347][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:52:33,860][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:52:34,362][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:52:34,865][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:52:35,369][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:52:35,874][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:52:36,381][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:52:36,883][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:52:37,383][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:52:37,886][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:52:38,388][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:52:38,888][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:52:39,389][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:52:39,889][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:52:40,390][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:52:40,890][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:52:41,390][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:52:41,889][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:52:42,392][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:52:42,918][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:52:43,423][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:52:43,927][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:52:44,444][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:52:44,945][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:52:45,449][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:52:45,953][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:52:46,458][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:52:46,966][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:52:47,469][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:52:47,972][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:52:48,478][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:52:48,982][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:52:49,486][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:52:49,986][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:52:50,489][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:52:50,990][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:52:51,491][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:52:51,991][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:52:52,497][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:52:52,998][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:52:53,497][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:52:54,999][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:52:54,501][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:52:55,004][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:52:55,506][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:52:56,006][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:52:56,507][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:52:57,008][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:52:57,511][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:52:58,011][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:52:58,515][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:52:59,025][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:52:59,527][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:53:00,052][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:53:00,549][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:53:01,052][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:53:01,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10848 tokens. [2025-11-12 23:53:02,260][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.49%, ΔTime: 00:00:32 [2025-11-12 23:53:03,038][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:53:03,039][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:53:03,041][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:53:03,945][__main__][INFO] - Iteration 117 took 52s (31.19% Gen, 67.09% Train). Generation: 16s, Training: 35s. Estimated remaining time: 42h 4m 3s. Estimated total time: 43h 47m 26s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 34s, 500 more iterations: 7h 17m 54s. [2025-11-12 23:53:03,947][__main__][INFO] - Starting iteration 117. [2025-11-12 23:53:04,428][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-12 23:53:04,429][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:53:07,560][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:53:19,852][__main__][INFO] - Number of regex retries in iteration 117: 1 [2025-11-12 23:53:19,852][__main__][INFO] - agents played in iteration 117 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:53:20,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:53:20,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:53:20,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:53:20,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:53:20,765][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:53:20,766][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:53:21,402][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:53:21,862][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:53:22,374][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:53:22,883][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:53:23,388][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:53:23,890][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:53:24,394][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:53:24,897][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:53:25,401][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:53:25,905][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:53:26,410][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:53:26,916][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:53:27,418][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:53:27,919][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:53:28,423][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:53:28,929][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:53:29,437][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:53:29,947][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:53:30,453][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:53:30,956][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:53:31,464][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:53:31,966][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:53:32,468][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:53:32,972][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:53:33,472][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:53:33,973][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:53:34,474][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:53:34,977][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:53:35,482][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:53:35,985][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:53:36,489][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:53:36,993][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:53:37,495][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:53:38,001][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:53:38,507][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:53:39,014][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:53:39,521][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:53:40,024][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:53:40,525][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:53:41,027][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:53:41,530][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:53:42,046][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:53:42,547][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:53:43,048][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:53:43,551][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:53:44,052][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:53:44,569][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:53:45,069][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:53:45,572][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:53:46,071][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:53:46,572][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:53:47,076][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:53:47,577][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:53:48,077][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:53:48,581][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:53:49,084][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:53:49,585][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:53:50,088][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:53:50,589][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:53:51,091][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:53:51,595][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:53:52,100][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:53:52,605][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:53:53,106][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:53:53,611][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10865 tokens. [2025-11-12 23:53:54,244][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:32 [2025-11-12 23:53:55,011][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:53:55,013][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:53:55,015][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:53:55,917][__main__][INFO] - Iteration 118 took 51s (29.95% Gen, 68.29% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 10m 14s. Estimated total time: 42h 54m 28s. Time estimates for 10 more iterations: 8m 34s, 100 more iterations: 1h 25m 48s, 500 more iterations: 7h 9m 4s. [2025-11-12 23:53:55,919][__main__][INFO] - Starting iteration 118. [2025-11-12 23:53:56,392][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-12 23:53:56,392][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:54:11,457][__main__][INFO] - Number of regex retries in iteration 118: 0 [2025-11-12 23:54:11,457][__main__][INFO] - agents played in iteration 118 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:54:12,246][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:54:12,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:54:12,299][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:54:12,322][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:54:12,322][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:54:12,324][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:54:12,966][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:54:13,434][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:54:13,943][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:54:14,446][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:54:14,949][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:54:15,458][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:54:15,975][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:54:16,482][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:54:16,986][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:54:17,488][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:54:17,990][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:54:18,494][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:54:19,001][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:54:19,505][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:54:20,008][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:54:20,510][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:54:21,013][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:54:21,519][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:54:22,024][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:54:22,528][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:54:23,029][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:54:23,530][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:54:24,044][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:54:24,546][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:54:25,064][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:54:25,569][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:54:26,076][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:54:26,578][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:54:27,084][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:54:27,587][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:54:28,092][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:54:28,595][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:54:29,102][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:54:29,606][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:54:30,108][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:54:30,608][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:54:31,110][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:54:31,612][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:54:32,110][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:54:32,612][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:54:33,112][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:54:33,613][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:54:34,116][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:54:34,619][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:54:35,127][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:54:35,637][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:54:36,140][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:54:36,647][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:54:37,149][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:54:37,650][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:54:38,167][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:54:38,668][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:54:39,171][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:54:39,672][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:54:40,172][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:54:40,685][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:54:41,187][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:54:41,709][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:54:42,214][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:54:42,718][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:54:43,226][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:54:43,731][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:54:44,232][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:54:44,734][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:54:45,234][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10860 tokens. [2025-11-12 23:54:45,876][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:32 [2025-11-12 23:54:46,627][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:54:46,629][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:54:46,630][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:54:47,565][__main__][INFO] - Iteration 119 took 51s (29.44% Gen, 68.73% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 53m 35s. Estimated total time: 42h 38m 41s. Time estimates for 10 more iterations: 8m 31s, 100 more iterations: 1h 25m 17s, 500 more iterations: 7h 6m 26s. [2025-11-12 23:54:47,567][__main__][INFO] - Starting iteration 119. [2025-11-12 23:54:48,047][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-12 23:54:48,047][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:55:03,143][__main__][INFO] - Number of regex retries in iteration 119: 0 [2025-11-12 23:55:03,144][__main__][INFO] - agents played in iteration 119 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:55:03,925][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:55:03,948][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:55:03,969][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:55:03,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:55:03,992][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:55:03,992][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:55:04,640][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:55:05,099][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:55:05,608][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:55:06,111][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:55:06,615][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:55:07,124][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:55:07,625][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:55:08,129][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:55:08,631][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:55:09,135][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:55:09,643][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:55:10,146][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:55:10,645][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:55:11,151][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:55:11,653][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:55:12,180][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:55:12,681][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:55:13,182][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:55:13,686][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:55:14,189][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:55:14,694][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:55:15,194][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:55:15,694][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:55:16,202][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:55:16,706][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:55:17,210][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:55:17,713][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:55:18,216][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:55:18,721][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:55:19,222][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:55:19,725][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:55:20,231][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:55:20,736][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:55:21,245][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:55:21,751][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:55:22,258][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:55:22,783][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:55:23,287][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:55:23,793][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:55:24,297][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:55:24,801][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:55:25,304][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:55:25,805][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:55:26,309][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:55:26,813][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:55:27,315][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:55:27,821][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:55:28,328][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:55:28,831][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:55:29,335][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:55:29,837][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:55:30,338][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:55:30,846][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:55:31,351][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:55:31,853][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:55:32,368][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:55:32,871][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:55:33,383][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:55:33,884][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:55:34,387][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:55:34,902][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:55:35,405][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:55:35,910][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:55:36,411][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:55:36,911][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10853 tokens. [2025-11-12 23:55:37,580][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.38%, ΔTime: 00:00:32 [2025-11-12 23:55:38,348][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:55:38,351][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:55:38,353][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:55:39,266][__main__][INFO] - Iteration 120 took 51s (29.48% Gen, 68.74% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 55m 2s. Estimated total time: 42h 41m 0s. Time estimates for 10 more iterations: 8m 32s, 100 more iterations: 1h 25m 22s, 500 more iterations: 7h 6m 50s. [2025-11-12 23:55:39,268][__main__][INFO] - Starting iteration 120. [2025-11-12 23:55:39,763][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 11 and human policies 1. [2025-11-12 23:55:39,764][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:55:55,780][__main__][INFO] - Number of regex retries in iteration 120: 0 [2025-11-12 23:55:55,781][__main__][INFO] - agents played in iteration 120 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:55:56,613][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:55:56,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:55:56,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:55:56,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:55:56,687][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:55:56,688][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:55:57,341][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:55:57,801][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:55:58,313][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:55:58,818][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:55:59,324][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:55:59,832][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:56:00,335][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:56:00,834][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:56:01,339][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:56:01,840][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:56:02,341][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:56:02,844][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:56:03,344][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:56:03,852][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:56:04,353][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:56:04,853][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:56:05,352][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:56:05,855][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:56:06,367][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:56:06,873][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:56:07,374][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:56:07,876][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:56:08,384][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:56:08,896][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:56:09,399][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:56:09,904][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:56:10,413][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:56:10,914][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:56:11,419][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:56:11,925][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:56:12,427][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:56:12,935][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:56:13,439][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:56:13,943][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:56:14,445][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:56:14,948][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:56:15,451][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:56:15,952][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:56:16,454][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:56:16,966][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:56:17,469][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:56:17,983][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:56:18,486][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:56:18,987][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:56:19,499][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:56:20,002][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:56:20,508][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:56:21,009][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:56:21,512][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:56:22,017][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:56:22,518][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:56:23,017][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:56:23,518][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:56:24,022][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:56:24,521][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:56:25,022][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:56:25,521][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:56:26,025][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:56:26,529][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:56:27,033][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:56:27,533][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:56:28,032][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:56:28,532][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:56:29,031][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:56:29,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10852 tokens. [2025-11-12 23:56:30,170][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:32 [2025-11-12 23:56:30,939][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:56:30,941][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:56:30,943][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:56:32,751][__main__][INFO] - Iteration 121 took 52s (30.23% Gen, 66.36% Train). Generation: 16s, Training: 35s. Estimated remaining time: 42h 22m 35s. Estimated total time: 44h 9m 27s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 18s, 500 more iterations: 7h 21m 34s. [2025-11-12 23:56:32,753][__main__][INFO] - Starting iteration 121. [2025-11-12 23:56:33,245][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-12 23:56:33,246][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:56:47,287][__main__][INFO] - Number of regex retries in iteration 121: 0 [2025-11-12 23:56:47,288][__main__][INFO] - agents played in iteration 121 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:56:48,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:56:48,142][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:56:48,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:56:48,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:56:48,190][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:56:48,191][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:56:48,831][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:56:49,293][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:56:49,810][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:56:50,315][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:56:50,818][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:56:51,321][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:56:51,823][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:56:52,327][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:56:52,829][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:56:53,330][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:56:53,831][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:56:54,332][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:56:54,835][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:56:55,334][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:56:55,838][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:56:56,340][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:56:56,843][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:56:57,347][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:56:57,849][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:56:58,352][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:56:58,857][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:56:59,361][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:56:59,866][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:57:00,370][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:57:00,875][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:57:01,404][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:57:01,909][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:57:02,412][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:57:02,919][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:57:03,423][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:57:03,927][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:57:04,431][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:57:04,934][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:57:05,445][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:57:05,948][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:57:06,451][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:57:06,954][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:57:07,457][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:57:07,964][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:57:08,467][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:57:08,971][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:57:09,479][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:57:09,984][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:57:10,487][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:57:11,006][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:57:11,509][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:57:12,020][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:57:12,523][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:57:13,026][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:57:13,528][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:57:14,031][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:57:14,534][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:57:15,038][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:57:15,544][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:57:16,048][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:57:16,551][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:57:17,054][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:57:17,560][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:57:18,065][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:57:18,570][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:57:19,072][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:57:19,573][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:57:20,086][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:57:20,587][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:57:21,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10862 tokens. [2025-11-12 23:57:21,729][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:32 [2025-11-12 23:57:22,502][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:57:22,504][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:57:22,506][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:57:23,439][__main__][INFO] - Iteration 122 took 50s (27.98% Gen, 70.16% Train). Generation: 14s, Training: 35s. Estimated remaining time: 40h 2m 1s. Estimated total time: 41h 49m 43s. Time estimates for 10 more iterations: 8m 21s, 100 more iterations: 1h 23m 39s, 500 more iterations: 6h 58m 17s. [2025-11-12 23:57:23,442][__main__][INFO] - Starting iteration 122. [2025-11-12 23:57:23,923][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-12 23:57:23,924][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:57:26,874][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:57:31,508][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:57:37,818][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:57:38,671][__main__][INFO] - Number of regex retries in iteration 122: 3 [2025-11-12 23:57:38,671][__main__][INFO] - agents played in iteration 122 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:57:39,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:57:39,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:57:39,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:57:39,547][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:57:39,548][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:57:39,549][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:57:40,225][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:57:40,684][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:57:41,208][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:57:41,713][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:57:42,216][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:57:42,722][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:57:43,224][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:57:43,726][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:57:44,229][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:57:44,730][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:57:45,232][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:57:45,732][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:57:46,235][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:57:46,737][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:57:47,238][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:57:47,742][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:57:48,245][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:57:48,753][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:57:49,258][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:57:49,761][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:57:50,263][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:57:50,769][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:57:51,275][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:57:51,795][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:57:52,305][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:57:52,817][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:57:53,319][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:57:53,824][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:57:54,331][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:57:54,834][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:57:55,341][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:57:55,846][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:57:56,351][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:57:56,854][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:57:57,356][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:57:57,858][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:57:58,369][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:57:58,870][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:57:59,368][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:57:59,877][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:58:00,376][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:58:00,901][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:58:01,403][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:58:01,905][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:58:02,409][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:58:02,909][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:58:03,411][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:58:03,912][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:58:04,414][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:58:04,925][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:58:05,426][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:58:05,929][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:58:06,432][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:58:06,933][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:58:07,437][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:58:07,938][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:58:08,440][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:58:08,946][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:58:09,451][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:58:09,952][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:58:10,452][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:58:10,952][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:58:11,453][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:58:11,953][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:58:12,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10851 tokens. [2025-11-12 23:58:13,144][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:32 [2025-11-12 23:58:13,909][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:58:13,910][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:58:13,912][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:58:14,854][__main__][INFO] - Iteration 123 took 50s (28.95% Gen, 69.19% Train). Generation: 14s, Training: 35s. Estimated remaining time: 40h 38m 1s. Estimated total time: 42h 26m 35s. Time estimates for 10 more iterations: 8m 29s, 100 more iterations: 1h 24m 53s, 500 more iterations: 7h 4m 25s. [2025-11-12 23:58:14,856][__main__][INFO] - Starting iteration 123. [2025-11-12 23:58:15,375][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-12 23:58:15,376][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:58:18,802][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-12 23:58:30,602][__main__][INFO] - Number of regex retries in iteration 123: 1 [2025-11-12 23:58:30,602][__main__][INFO] - agents played in iteration 123 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:58:31,407][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:58:31,432][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:58:31,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:58:31,479][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:58:31,479][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:58:31,480][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:58:32,103][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:58:32,562][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:58:33,069][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:58:33,575][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:58:34,090][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:58:34,595][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:58:35,099][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:58:35,605][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:58:36,105][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:58:36,623][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:58:37,126][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:58:37,627][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:58:38,133][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:58:38,635][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:58:39,142][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:58:39,644][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:58:40,148][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:58:40,651][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:58:41,156][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:58:41,660][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:58:42,165][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:58:42,670][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:58:43,177][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:58:43,681][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:58:44,187][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:58:44,706][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:58:45,211][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:58:45,722][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:58:46,229][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:58:46,733][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:58:47,244][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:58:47,750][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:58:48,263][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:58:48,767][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:58:49,270][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:58:49,785][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:58:50,287][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:58:50,794][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:58:51,297][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:58:51,800][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:58:52,306][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:58:52,806][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:58:53,309][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:58:53,813][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:58:54,314][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:58:54,827][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:58:55,329][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:58:55,833][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:58:56,338][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:58:56,842][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:58:57,346][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:58:57,849][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:58:58,353][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:58:58,860][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:58:59,366][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:58:59,866][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:59:00,368][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:59:00,869][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:59:01,372][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:59:01,874][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:59:02,378][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:59:02,891][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:59:03,395][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:59:03,906][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:59:04,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10868 tokens. [2025-11-12 23:59:05,122][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.28%, Current % of VRAM taken: 58.52%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-12 23:59:05,878][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:59:05,880][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:59:05,882][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:59:06,812][__main__][INFO] - Iteration 124 took 51s (29.60% Gen, 68.59% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 2m 26s. Estimated total time: 42h 51m 51s. Time estimates for 10 more iterations: 8m 34s, 100 more iterations: 1h 25m 43s, 500 more iterations: 7h 8m 38s. [2025-11-12 23:59:06,815][__main__][INFO] - Starting iteration 124. [2025-11-12 23:59:07,290][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-12 23:59:07,290][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-12 23:59:22,086][__main__][INFO] - Number of regex retries in iteration 124: 0 [2025-11-12 23:59:22,087][__main__][INFO] - agents played in iteration 124 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-12 23:59:22,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:59:22,951][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:59:22,977][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:59:22,999][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-12 23:59:22,999][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-12 23:59:23,000][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-12 23:59:23,636][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-12 23:59:24,094][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-12 23:59:24,607][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-12 23:59:25,108][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-12 23:59:25,607][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-12 23:59:26,111][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-12 23:59:26,612][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-12 23:59:27,110][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-12 23:59:27,613][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-12 23:59:28,118][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-12 23:59:28,627][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-12 23:59:29,131][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-12 23:59:29,632][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-12 23:59:30,139][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-12 23:59:30,641][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-12 23:59:31,145][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-12 23:59:31,651][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-12 23:59:32,156][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-12 23:59:32,664][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-12 23:59:33,173][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-12 23:59:33,681][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-12 23:59:34,189][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-12 23:59:34,696][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-12 23:59:35,217][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-12 23:59:35,724][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-12 23:59:36,232][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-12 23:59:36,741][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-12 23:59:37,248][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-12 23:59:37,756][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-12 23:59:38,262][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-12 23:59:38,766][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-12 23:59:39,270][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-12 23:59:39,772][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-12 23:59:40,273][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-12 23:59:40,777][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-12 23:59:41,280][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-12 23:59:41,784][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-12 23:59:42,286][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-12 23:59:42,789][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-12 23:59:43,301][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-12 23:59:43,805][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-12 23:59:44,313][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-12 23:59:44,813][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-12 23:59:45,314][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-12 23:59:45,829][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-12 23:59:46,331][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-12 23:59:46,832][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-12 23:59:47,332][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-12 23:59:47,831][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-12 23:59:48,338][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-12 23:59:48,845][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-12 23:59:49,350][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-12 23:59:49,856][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-12 23:59:50,359][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-12 23:59:50,860][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-12 23:59:51,361][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-12 23:59:51,861][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-12 23:59:52,359][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-12 23:59:52,858][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-12 23:59:53,363][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-12 23:59:53,866][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-12 23:59:54,366][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-12 23:59:54,866][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-12 23:59:55,371][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-12 23:59:55,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10850 tokens. [2025-11-12 23:59:56,569][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:32 [2025-11-12 23:59:57,325][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-12 23:59:57,327][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-12 23:59:57,329][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-12 23:59:58,229][__main__][INFO] - Iteration 125 took 50s (29.05% Gen, 69.18% Train). Generation: 14s, Training: 35s. Estimated remaining time: 40h 36m 42s. Estimated total time: 42h 26m 59s. Time estimates for 10 more iterations: 8m 29s, 100 more iterations: 1h 24m 53s, 500 more iterations: 7h 4m 29s. [2025-11-12 23:59:58,231][__main__][INFO] - Starting iteration 125. [2025-11-12 23:59:58,704][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-12 23:59:58,705][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:00:13,490][__main__][INFO] - Number of regex retries in iteration 125: 0 [2025-11-13 00:00:13,490][__main__][INFO] - agents played in iteration 125 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:00:14,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:00:14,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:00:14,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:00:14,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:00:14,411][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:00:14,412][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:00:15,064][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:00:15,524][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:00:16,030][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:00:16,541][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:00:17,046][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:00:17,548][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:00:18,051][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:00:18,555][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:00:19,058][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:00:19,564][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:00:20,070][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:00:20,575][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:00:21,080][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:00:21,587][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:00:22,094][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:00:22,602][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:00:23,133][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:00:23,639][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:00:24,146][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:00:24,654][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:00:25,159][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:00:25,669][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:00:26,174][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:00:26,678][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:00:27,187][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:00:27,690][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:00:28,197][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:00:28,700][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:00:29,208][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:00:29,733][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:00:30,238][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:00:30,746][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:00:31,248][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:00:31,751][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:00:32,256][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:00:32,759][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:00:33,263][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:00:33,767][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:00:34,270][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:00:34,772][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:00:35,273][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:00:35,774][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:00:36,278][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:00:36,782][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:00:37,285][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:00:37,790][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:00:38,297][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:00:38,807][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:00:39,310][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:00:39,812][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:00:40,321][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:00:40,824][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:00:41,323][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:00:41,828][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:00:42,329][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:00:42,837][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:00:43,337][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:00:43,840][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:00:44,354][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:00:44,856][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:00:45,358][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:00:45,864][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:00:46,370][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:00:46,877][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:00:47,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10857 tokens. [2025-11-13 00:00:48,055][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.30%, Current % of VRAM taken: 58.55%, Block Peak % of device VRAM: 62.49%, ΔTime: 00:00:33 [2025-11-13 00:00:48,801][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:00:48,803][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:00:48,804][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:00:49,651][__main__][INFO] - Iteration 126 took 50s (29.02% Gen, 69.31% Train). Generation: 14s, Training: 35s. Estimated remaining time: 40h 36m 15s. Estimated total time: 42h 27m 23s. Time estimates for 10 more iterations: 8m 29s, 100 more iterations: 1h 24m 54s, 500 more iterations: 7h 4m 33s. [2025-11-13 00:00:49,653][__main__][INFO] - Starting iteration 126. [2025-11-13 00:00:50,148][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 00:00:50,148][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:00:52,951][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:00:57,447][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:01:03,957][__main__][INFO] - Number of regex retries in iteration 126: 2 [2025-11-13 00:01:03,958][__main__][INFO] - agents played in iteration 126 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:01:04,747][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:01:04,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:01:04,792][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:01:04,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:01:04,815][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:01:04,816][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:01:05,459][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:01:05,922][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:01:06,428][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:01:06,941][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:01:07,447][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:01:07,954][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:01:08,457][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:01:08,959][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:01:09,463][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:01:09,966][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:01:10,468][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:01:10,969][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:01:11,470][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:01:11,973][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:01:12,474][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:01:12,977][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:01:13,483][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:01:13,988][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:01:14,493][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:01:14,997][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:01:15,500][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:01:16,006][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:01:16,509][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:01:17,014][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:01:17,518][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:01:18,022][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:01:18,539][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:01:19,046][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:01:19,549][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:01:20,055][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:01:20,558][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:01:21,062][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:01:21,568][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:01:22,073][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:01:22,581][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:01:23,088][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:01:23,594][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:01:24,097][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:01:24,600][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:01:25,106][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:01:25,613][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:01:26,115][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:01:26,618][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:01:27,121][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:01:27,639][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:01:28,145][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:01:28,653][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:01:29,163][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:01:29,666][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:01:30,169][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:01:30,670][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:01:31,174][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:01:31,680][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:01:32,186][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:01:32,689][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:01:33,190][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:01:33,691][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:01:34,193][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:01:34,697][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:01:35,201][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:01:35,715][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:01:36,220][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:01:36,731][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:01:37,235][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:01:37,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10871 tokens. [2025-11-13 00:01:38,434][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:32 [2025-11-13 00:01:39,191][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:01:39,193][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:01:39,195][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:01:40,071][__main__][INFO] - Iteration 127 took 49s (27.66% Gen, 70.58% Train). Generation: 13s, Training: 35s. Estimated remaining time: 39h 44m 14s. Estimated total time: 41h 36m 12s. Time estimates for 10 more iterations: 8m 19s, 100 more iterations: 1h 23m 12s, 500 more iterations: 6h 56m 2s. [2025-11-13 00:01:40,074][__main__][INFO] - Starting iteration 127. [2025-11-13 00:01:40,570][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 00:01:40,570][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:01:45,136][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:01:54,992][__main__][INFO] - Number of regex retries in iteration 127: 1 [2025-11-13 00:01:54,993][__main__][INFO] - agents played in iteration 127 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:01:55,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:01:55,811][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:01:55,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:01:55,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:01:55,856][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:01:55,857][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:01:56,515][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:01:56,980][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:01:57,490][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:01:57,997][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:01:58,500][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:01:59,007][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:01:59,509][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:02:00,012][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:02:00,513][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:02:01,014][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:02:01,515][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:02:02,018][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:02:02,521][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:02:03,023][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:02:03,525][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:02:04,027][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:02:04,543][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:02:05,048][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:02:05,552][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:02:06,057][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:02:06,562][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:02:07,070][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:02:07,573][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:02:08,080][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:02:08,587][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:02:09,092][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:02:09,597][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:02:10,102][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:02:10,608][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:02:11,113][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:02:11,615][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:02:12,118][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:02:12,634][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:02:13,139][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:02:13,656][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:02:14,164][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:02:14,667][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:02:15,171][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:02:15,672][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:02:16,174][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:02:16,676][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:02:17,176][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:02:17,681][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:02:18,182][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:02:18,685][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:02:19,187][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:02:19,686][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:02:20,189][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:02:20,690][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:02:21,193][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:02:21,697][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:02:22,199][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:02:22,699][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:02:23,199][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:02:23,699][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:02:24,201][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:02:24,706][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:02:25,207][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:02:25,710][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:02:26,216][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:02:26,719][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:02:27,225][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:02:27,728][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:02:28,233][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:02:28,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10864 tokens. [2025-11-13 00:02:29,452][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:32 [2025-11-13 00:02:30,183][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:02:30,184][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:02:30,186][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:02:31,113][__main__][INFO] - Iteration 128 took 50s (28.53% Gen, 69.63% Train). Generation: 14s, Training: 35s. Estimated remaining time: 40h 14m 23s. Estimated total time: 42h 7m 13s. Time estimates for 10 more iterations: 8m 25s, 100 more iterations: 1h 24m 14s, 500 more iterations: 7h 1m 12s. [2025-11-13 00:02:31,115][__main__][INFO] - Starting iteration 128. [2025-11-13 00:02:31,596][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 00:02:31,596][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:02:35,437][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:02:37,810][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:02:47,646][__main__][INFO] - Number of regex retries in iteration 128: 2 [2025-11-13 00:02:47,647][__main__][INFO] - agents played in iteration 128 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:02:48,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:02:48,518][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:02:48,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:02:48,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:02:48,569][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:02:48,569][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:02:49,232][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:02:49,693][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:02:50,201][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:02:50,704][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:02:51,205][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:02:51,707][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:02:52,209][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:02:52,710][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:02:53,214][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:02:53,718][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:02:54,221][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:02:54,739][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:02:55,248][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:02:55,759][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:02:56,264][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:02:56,768][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:02:57,274][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:02:57,784][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:02:58,289][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:02:58,793][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:02:59,300][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:02:59,811][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:03:00,319][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:03:00,828][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:03:01,339][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:03:01,843][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:03:02,358][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:03:02,864][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:03:03,370][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:03:03,877][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:03:04,383][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:03:04,887][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:03:05,391][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:03:05,928][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:03:06,452][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:03:06,958][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:03:07,465][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:03:07,969][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:03:08,473][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:03:08,979][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:03:09,488][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:03:09,994][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:03:10,498][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:03:11,001][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:03:11,504][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:03:12,005][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:03:12,509][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:03:13,015][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:03:13,518][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:03:14,022][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:03:14,525][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:03:15,025][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:03:15,530][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:03:16,032][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:03:16,535][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:03:17,037][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:03:17,539][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:03:18,063][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:03:18,571][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:03:19,074][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:03:19,579][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:03:20,083][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:03:20,588][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:03:21,093][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:03:21,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10864 tokens. [2025-11-13 00:03:22,268][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.42%, ΔTime: 00:00:33 [2025-11-13 00:03:23,004][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:03:23,006][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:03:23,007][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:03:23,925][__main__][INFO] - Iteration 129 took 52s (30.67% Gen, 67.57% Train). Generation: 16s, Training: 35s. Estimated remaining time: 41h 42m 47s. Estimated total time: 43h 36m 30s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 13s, 500 more iterations: 7h 16m 5s. [2025-11-13 00:03:23,928][__main__][INFO] - Starting iteration 129. [2025-11-13 00:03:24,402][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 00:03:24,403][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:03:29,211][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:03:38,406][__main__][INFO] - Number of regex retries in iteration 129: 1 [2025-11-13 00:03:38,406][__main__][INFO] - agents played in iteration 129 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:03:39,272][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:03:39,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:03:39,322][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:03:39,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:03:39,344][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:03:39,345][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:03:40,006][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:03:40,470][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:03:40,977][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:03:41,485][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:03:41,987][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:03:42,490][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:03:42,997][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:03:43,499][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:03:44,003][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:03:44,509][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:03:45,015][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:03:45,520][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:03:46,025][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:03:46,530][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:03:47,051][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:03:47,559][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:03:48,068][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:03:48,573][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:03:49,076][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:03:49,585][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:03:50,089][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:03:50,592][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:03:51,096][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:03:51,600][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:03:52,105][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:03:52,609][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:03:53,114][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:03:53,618][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:03:54,119][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:03:54,623][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:03:55,126][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:03:55,629][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:03:56,147][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:03:56,651][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:03:57,155][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:03:57,661][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:03:58,163][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:03:58,674][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:03:59,175][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:03:59,679][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:04:00,187][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:04:00,690][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:04:01,195][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:04:01,696][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:04:02,199][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:04:02,704][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:04:03,204][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:04:03,708][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:04:04,212][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:04:04,713][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:04:05,216][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:04:05,722][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:04:06,225][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:04:06,727][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:04:07,227][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:04:07,728][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:04:08,228][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:04:08,733][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:04:09,247][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:04:09,750][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:04:10,253][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:04:10,755][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:04:11,259][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:04:11,769][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:04:12,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10823 tokens. [2025-11-13 00:04:12,974][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:32 [2025-11-13 00:04:13,726][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:04:13,727][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:04:13,730][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:04:14,611][__main__][INFO] - Iteration 130 took 50s (27.89% Gen, 70.35% Train). Generation: 14s, Training: 35s. Estimated remaining time: 39h 55m 55s. Estimated total time: 41h 50m 28s. Time estimates for 10 more iterations: 8m 22s, 100 more iterations: 1h 23m 40s, 500 more iterations: 6h 58m 24s. [2025-11-13 00:04:14,614][__main__][INFO] - Starting iteration 130. [2025-11-13 00:04:15,099][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 12 and human policies 1. [2025-11-13 00:04:15,100][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:04:21,879][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:04:22,902][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:04:30,126][__main__][INFO] - Number of regex retries in iteration 130: 2 [2025-11-13 00:04:30,127][__main__][INFO] - agents played in iteration 130 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:04:30,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:04:30,991][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:04:31,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:04:31,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:04:31,039][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:04:31,040][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:04:31,650][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:04:32,107][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:04:32,617][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:04:33,122][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:04:33,627][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:04:34,131][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:04:34,634][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:04:35,136][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:04:35,649][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:04:36,152][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:04:36,660][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:04:37,164][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:04:37,665][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:04:38,170][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:04:38,679][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:04:39,188][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:04:39,694][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:04:40,203][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:04:40,707][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:04:41,215][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:04:41,720][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:04:42,225][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:04:42,729][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:04:43,234][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:04:43,740][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:04:44,246][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:04:44,753][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:04:45,255][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:04:45,760][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:04:46,268][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:04:46,773][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:04:47,287][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:04:47,791][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:04:48,299][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:04:48,809][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:04:49,315][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:04:49,820][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:04:50,326][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:04:50,834][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:04:51,340][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:04:51,846][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:04:52,350][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:04:52,853][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:04:53,364][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:04:53,887][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:04:54,393][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:04:54,895][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:04:55,398][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:04:55,905][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:04:56,410][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:04:56,915][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:04:57,419][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:04:57,924][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:04:58,428][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:04:58,930][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:04:59,432][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:04:59,933][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:05:00,435][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:05:00,941][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:05:01,447][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:05:01,962][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:05:02,466][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:05:02,979][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:05:03,482][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:05:03,986][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10850 tokens. [2025-11-13 00:05:04,665][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 00:05:05,410][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:05:05,412][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:05:05,413][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:05:07,326][__main__][INFO] - Iteration 131 took 52s (28.77% Gen, 67.56% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 35m 55s. Estimated total time: 43h 31m 21s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 2s, 500 more iterations: 7h 15m 13s. [2025-11-13 00:05:07,328][__main__][INFO] - Starting iteration 131. [2025-11-13 00:05:07,812][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 00:05:07,813][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:05:13,621][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:05:24,772][__main__][INFO] - Number of regex retries in iteration 131: 1 [2025-11-13 00:05:24,773][__main__][INFO] - agents played in iteration 131 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:05:25,593][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:05:25,621][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:05:25,649][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:05:25,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:05:25,673][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:05:25,674][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:05:26,323][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:05:26,791][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:05:27,298][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:05:27,804][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:05:28,309][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:05:28,821][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:05:29,328][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:05:29,834][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:05:30,341][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:05:30,870][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:05:31,377][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:05:31,880][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:05:32,385][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:05:32,892][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:05:33,400][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:05:33,905][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:05:34,410][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:05:34,914][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:05:35,418][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:05:35,922][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:05:36,426][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:05:36,933][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:05:37,442][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:05:37,946][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:05:38,455][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:05:38,961][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:05:39,469][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:05:39,974][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:05:40,477][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:05:40,985][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:05:41,490][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:05:41,993][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:05:42,499][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:05:43,002][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:05:43,509][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:05:44,016][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:05:44,521][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:05:45,024][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:05:45,526][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:05:46,030][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:05:46,536][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:05:47,041][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:05:47,544][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:05:48,046][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:05:48,548][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:05:49,052][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:05:49,555][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:05:50,058][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:05:50,577][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:05:51,081][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:05:51,591][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:05:52,097][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:05:52,600][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:05:53,107][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:05:53,609][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:05:54,113][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:05:54,617][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:05:55,120][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:05:55,625][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:05:56,127][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:05:56,631][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:05:57,132][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:05:57,633][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:05:58,133][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:05:58,635][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10850 tokens. [2025-11-13 00:05:59,300][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:32 [2025-11-13 00:06:00,045][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:06:00,047][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:06:00,048][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:06:00,972][__main__][INFO] - Iteration 132 took 53s (31.90% Gen, 66.36% Train). Generation: 16s, Training: 35s. Estimated remaining time: 42h 21m 41s. Estimated total time: 44h 18m 0s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 36s, 500 more iterations: 7h 23m 0s. [2025-11-13 00:06:00,974][__main__][INFO] - Starting iteration 132. [2025-11-13 00:06:01,461][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 00:06:01,462][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:06:05,567][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:06:07,926][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:06:17,231][__main__][INFO] - Number of regex retries in iteration 132: 2 [2025-11-13 00:06:17,231][__main__][INFO] - agents played in iteration 132 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:06:18,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:06:18,121][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:06:18,147][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:06:18,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:06:18,170][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:06:18,171][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:06:18,822][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:06:19,287][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:06:19,800][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:06:20,313][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:06:20,819][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:06:21,325][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:06:21,831][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:06:22,335][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:06:22,836][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:06:23,337][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:06:23,839][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:06:24,343][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:06:24,845][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:06:25,350][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:06:25,854][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:06:26,357][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:06:26,859][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:06:27,364][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:06:27,867][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:06:28,371][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:06:28,872][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:06:29,376][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:06:29,877][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:06:30,385][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:06:30,904][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:06:31,408][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:06:31,913][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:06:32,418][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:06:32,923][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:06:33,429][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:06:33,934][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:06:34,443][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:06:34,948][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:06:35,456][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:06:35,962][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:06:36,470][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:06:36,977][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:06:37,495][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:06:37,998][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:06:38,500][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:06:39,002][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:06:39,509][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:06:40,014][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:06:40,517][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:06:41,019][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:06:41,523][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:06:42,025][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:06:42,555][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:06:43,059][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:06:43,564][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:06:44,086][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:06:44,590][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:06:45,095][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:06:45,600][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:06:46,108][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:06:46,617][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:06:47,120][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:06:47,627][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:06:48,132][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:06:48,633][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:06:49,135][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:06:49,638][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:06:50,142][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:06:50,646][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:06:51,150][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10866 tokens. [2025-11-13 00:06:51,817][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.27%, Current % of VRAM taken: 58.52%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:33 [2025-11-13 00:06:52,565][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:06:52,566][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:06:52,568][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:06:53,494][__main__][INFO] - Iteration 133 took 52s (30.31% Gen, 67.91% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 24m 29s. Estimated total time: 43h 21m 41s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 43s, 500 more iterations: 7h 13m 36s. [2025-11-13 00:06:53,497][__main__][INFO] - Starting iteration 133. [2025-11-13 00:06:53,990][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 00:06:53,990][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:06:59,864][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:07:02,718][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:07:10,306][__main__][INFO] - Number of regex retries in iteration 133: 2 [2025-11-13 00:07:10,307][__main__][INFO] - agents played in iteration 133 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:07:11,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:07:11,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:07:11,226][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:07:11,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:07:11,250][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:07:11,250][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:07:11,915][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:07:12,381][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:07:12,907][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:07:13,412][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:07:13,916][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:07:14,425][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:07:14,931][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:07:15,436][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:07:15,939][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:07:16,441][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:07:16,944][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:07:17,448][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:07:17,952][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:07:18,456][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:07:18,966][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:07:19,476][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:07:19,980][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:07:20,486][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:07:20,989][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:07:21,495][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:07:22,003][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:07:22,506][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:07:23,010][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:07:23,521][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:07:24,026][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:07:24,530][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:07:25,034][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:07:25,537][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:07:26,043][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:07:26,549][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:07:27,056][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:07:27,564][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:07:28,068][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:07:28,572][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:07:29,089][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:07:29,593][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:07:30,108][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:07:30,614][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:07:31,117][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:07:31,622][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:07:32,123][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:07:32,626][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:07:33,136][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:07:33,646][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:07:34,154][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:07:34,660][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:07:35,166][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:07:35,680][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:07:36,183][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:07:36,700][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:07:37,205][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:07:37,709][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:07:38,217][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:07:38,719][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:07:39,224][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:07:39,730][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:07:40,231][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:07:40,735][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:07:41,237][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:07:41,739][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:07:42,242][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:07:42,744][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:07:43,248][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:07:43,750][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:07:44,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10853 tokens. [2025-11-13 00:07:44,911][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.49%, ΔTime: 00:00:33 [2025-11-13 00:07:45,679][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:07:45,680][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:07:45,683][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:07:46,646][__main__][INFO] - Iteration 134 took 52s (30.99% Gen, 67.18% Train). Generation: 16s, Training: 35s. Estimated remaining time: 41h 54m 46s. Estimated total time: 43h 52m 51s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 45s, 500 more iterations: 7h 18m 48s. [2025-11-13 00:07:46,649][__main__][INFO] - Starting iteration 134. [2025-11-13 00:07:47,144][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 00:07:47,144][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:08:01,254][__main__][INFO] - Number of regex retries in iteration 134: 0 [2025-11-13 00:08:01,254][__main__][INFO] - agents played in iteration 134 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:08:02,097][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:08:02,127][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:08:02,154][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:08:02,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:08:02,179][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:08:02,179][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:08:02,876][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:08:03,341][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:08:03,850][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:08:04,358][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:08:04,864][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:08:05,367][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:08:05,875][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:08:06,379][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:08:06,900][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:08:07,405][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:08:07,910][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:08:08,416][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:08:08,922][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:08:09,430][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:08:09,933][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:08:10,437][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:08:10,941][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:08:11,443][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:08:11,948][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:08:12,452][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:08:12,956][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:08:13,462][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:08:13,967][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:08:14,470][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:08:14,973][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:08:15,475][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:08:15,993][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:08:16,496][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:08:17,001][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:08:17,506][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:08:18,010][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:08:18,520][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:08:19,026][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:08:19,530][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:08:20,042][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:08:20,546][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:08:21,049][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:08:21,553][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:08:22,059][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:08:22,567][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:08:23,076][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:08:23,582][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:08:24,091][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:08:24,593][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:08:25,096][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:08:25,602][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:08:26,107][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:08:26,616][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:08:27,119][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:08:27,620][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:08:28,122][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:08:28,623][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:08:29,126][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:08:29,627][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:08:30,129][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:08:30,631][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:08:31,132][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:08:31,633][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:08:32,138][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:08:32,640][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:08:33,145][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:08:33,648][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:08:34,150][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:08:34,653][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:08:35,155][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10839 tokens. [2025-11-13 00:08:35,812][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:32 [2025-11-13 00:08:36,561][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:08:36,562][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:08:36,564][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:08:37,493][__main__][INFO] - Iteration 135 took 50s (28.02% Gen, 70.13% Train). Generation: 14s, Training: 35s. Estimated remaining time: 39h 58m 31s. Estimated total time: 41h 57m 27s. Time estimates for 10 more iterations: 8m 23s, 100 more iterations: 1h 23m 54s, 500 more iterations: 6h 59m 34s. [2025-11-13 00:08:37,495][__main__][INFO] - Starting iteration 135. [2025-11-13 00:08:37,982][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 00:08:37,983][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:08:52,159][mllm.models.large_language_model_local][WARNING] - Response Given the pattern in the per-item values and the previous outcomes, it seems that both you and Bob have incentives to propose taking all items to maximize points, as long as the per-item values align in a way that makes this beneficial. However, in scenarios where per-item values are significantly different, it might be more strategic to propose taking items that align better with your higher value. In this round, you have a moderate value for hats and books (1 and 10 respectively), while Bob has higher values for books and balls. Given that you already have a good balance, here's the proposal: Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:08:53,472][__main__][INFO] - Number of regex retries in iteration 135: 1 [2025-11-13 00:08:53,473][__main__][INFO] - agents played in iteration 135 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:08:54,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:08:54,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:08:54,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:08:54,333][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:08:54,333][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:08:54,334][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:08:55,002][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:08:55,468][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:08:56,075][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:08:56,585][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:08:57,090][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:08:57,600][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:08:58,105][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:08:58,609][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:08:59,114][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:08:59,618][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:09:00,123][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:09:00,629][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:09:01,132][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:09:01,634][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:09:02,137][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:09:02,641][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:09:03,146][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:09:03,648][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:09:04,148][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:09:04,651][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:09:05,153][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:09:05,655][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:09:06,157][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:09:06,662][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:09:07,164][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:09:07,666][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:09:08,169][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:09:08,673][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:09:09,177][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:09:09,690][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:09:10,191][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:09:10,693][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:09:11,199][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:09:11,701][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:09:12,221][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:09:12,726][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:09:13,232][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:09:13,733][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:09:14,234][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:09:14,737][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:09:15,244][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:09:15,750][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:09:16,254][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:09:16,757][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:09:17,260][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:09:17,762][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:09:18,260][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:09:18,764][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:09:19,263][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:09:19,768][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:09:20,279][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:09:20,782][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:09:21,295][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:09:21,796][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:09:22,300][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:09:22,807][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:09:23,309][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:09:23,818][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:09:24,319][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:09:24,820][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:09:25,330][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:09:25,831][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:09:26,334][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:09:26,838][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:09:27,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10847 tokens. [2025-11-13 00:09:28,020][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 00:09:28,798][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:09:28,800][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:09:28,802][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:09:29,722][__main__][INFO] - Iteration 136 took 51s (29.94% Gen, 68.28% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 7m 11s. Estimated total time: 43h 6m 59s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 13s, 500 more iterations: 7h 11m 9s. [2025-11-13 00:09:29,724][__main__][INFO] - Starting iteration 136. [2025-11-13 00:09:30,210][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 00:09:30,211][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:09:34,154][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:09:46,165][__main__][INFO] - Number of regex retries in iteration 136: 1 [2025-11-13 00:09:46,165][__main__][INFO] - agents played in iteration 136 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:09:46,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:09:46,982][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:09:47,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:09:47,030][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:09:47,031][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:09:47,031][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:09:47,670][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:09:48,127][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:09:48,639][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:09:49,142][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:09:49,647][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:09:50,156][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:09:50,657][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:09:51,159][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:09:51,664][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:09:52,172][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:09:52,678][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:09:53,184][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:09:53,706][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:09:54,214][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:09:54,717][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:09:55,237][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:09:55,741][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:09:56,247][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:09:56,759][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:09:57,271][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:09:57,785][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:09:58,298][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:09:58,812][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:09:59,336][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:09:59,840][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:10:00,346][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:10:00,848][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:10:01,351][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:10:01,858][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:10:02,364][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:10:02,868][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:10:03,375][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:10:03,880][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:10:04,391][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:10:04,895][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:10:05,399][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:10:05,916][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:10:06,421][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:10:06,935][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:10:07,442][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:10:07,946][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:10:08,450][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:10:08,952][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:10:09,453][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:10:09,954][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:10:10,456][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:10:10,958][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:10:11,460][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:10:11,962][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:10:12,466][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:10:12,967][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:10:13,470][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:10:13,973][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:10:14,472][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:10:14,973][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:10:15,475][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:10:15,977][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:10:16,483][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:10:16,987][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:10:17,491][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:10:17,996][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:10:18,501][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:10:19,003][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:10:19,510][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:10:20,014][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10837 tokens. [2025-11-13 00:10:20,727][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.30%, Current % of VRAM taken: 58.54%, Block Peak % of device VRAM: 62.36%, ΔTime: 00:00:33 [2025-11-13 00:10:21,483][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:10:21,485][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:10:21,487][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:10:22,390][__main__][INFO] - Iteration 137 took 52s (30.58% Gen, 67.69% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 28m 20s. Estimated total time: 43h 29m 0s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 58s, 500 more iterations: 7h 14m 50s. [2025-11-13 00:10:22,392][__main__][INFO] - Starting iteration 137. [2025-11-13 00:10:22,876][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 00:10:22,877][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:10:26,548][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:10:35,390][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:10:37,538][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:10:38,454][__main__][INFO] - Number of regex retries in iteration 137: 3 [2025-11-13 00:10:38,455][__main__][INFO] - agents played in iteration 137 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:10:39,276][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:10:39,304][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:10:39,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:10:39,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:10:39,353][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:10:39,354][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:10:40,035][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:10:40,498][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:10:41,008][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:10:41,515][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:10:42,016][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:10:42,522][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:10:43,030][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:10:43,534][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:10:44,055][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:10:44,564][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:10:45,069][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:10:45,572][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:10:46,076][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:10:46,579][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:10:47,084][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:10:47,586][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:10:48,092][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:10:48,592][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:10:49,092][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:10:49,593][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:10:50,098][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:10:50,602][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:10:51,106][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:10:51,606][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:10:52,117][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:10:52,616][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:10:53,116][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:10:53,617][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:10:54,119][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:10:54,622][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:10:55,124][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:10:55,628][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:10:56,132][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:10:56,638][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:10:57,149][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:10:57,652][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:10:58,156][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:10:58,666][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:10:59,170][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:10:59,671][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:11:00,171][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:11:00,678][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:11:01,181][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:11:01,683][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:11:02,202][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:11:02,708][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:11:03,213][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:11:03,720][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:11:04,227][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:11:04,731][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:11:05,232][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:11:05,734][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:11:06,234][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:11:06,737][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:11:07,238][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:11:07,757][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:11:08,256][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:11:08,763][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:11:09,262][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:11:09,762][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:11:10,270][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:11:10,771][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:11:11,274][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:11:11,779][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:11:12,280][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10827 tokens. [2025-11-13 00:11:12,924][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:32 [2025-11-13 00:11:13,670][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:11:13,671][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:11:13,673][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:11:14,814][__main__][INFO] - Iteration 138 took 51s (29.99% Gen, 67.81% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 15m 21s. Estimated total time: 43h 16m 54s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 33s, 500 more iterations: 7h 12m 49s. [2025-11-13 00:11:14,817][__main__][INFO] - Starting iteration 138. [2025-11-13 00:11:15,299][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 00:11:15,299][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:11:24,862][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:11:30,212][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:11:31,152][__main__][INFO] - Number of regex retries in iteration 138: 2 [2025-11-13 00:11:31,152][__main__][INFO] - agents played in iteration 138 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:11:32,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:11:32,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:11:32,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:11:32,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:11:32,082][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:11:32,083][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:11:32,763][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:11:33,225][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:11:33,735][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:11:34,252][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:11:34,751][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:11:35,279][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:11:35,782][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:11:36,287][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:11:36,792][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:11:37,299][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:11:37,810][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:11:38,317][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:11:38,823][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:11:39,329][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:11:39,832][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:11:40,337][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:11:40,846][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:11:41,358][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:11:41,882][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:11:42,388][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:11:42,895][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:11:43,429][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:11:43,940][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:11:44,474][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:11:44,984][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:11:45,493][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:11:45,999][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:11:46,502][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:11:47,006][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:11:47,510][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:11:48,016][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:11:48,519][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:11:49,024][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:11:49,529][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:11:50,033][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:11:50,539][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:11:51,042][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:11:51,548][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:11:52,051][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:11:52,558][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:11:53,063][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:11:53,586][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:11:54,091][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:11:54,595][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:11:55,098][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:11:55,601][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:11:56,108][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:11:56,611][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:11:57,115][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:11:57,619][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:11:58,117][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:11:58,620][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:11:59,126][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:11:59,629][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:12:00,134][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:12:00,635][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:12:01,138][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:12:01,640][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:12:02,145][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:12:02,667][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:12:03,169][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:12:03,672][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:12:04,174][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:12:04,676][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:12:05,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10821 tokens. [2025-11-13 00:12:05,823][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.28%, Current % of VRAM taken: 58.53%, Block Peak % of device VRAM: 62.51%, ΔTime: 00:00:33 [2025-11-13 00:12:06,587][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:12:06,589][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:12:06,591][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:12:07,523][__main__][INFO] - Iteration 139 took 52s (30.36% Gen, 67.86% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 28m 46s. Estimated total time: 43h 31m 12s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 2s, 500 more iterations: 7h 15m 12s. [2025-11-13 00:12:07,525][__main__][INFO] - Starting iteration 139. [2025-11-13 00:12:07,998][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 00:12:07,998][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:12:11,142][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:12:13,264][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:12:22,815][__main__][INFO] - Number of regex retries in iteration 139: 2 [2025-11-13 00:12:22,816][__main__][INFO] - agents played in iteration 139 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:12:23,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:12:23,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:12:23,711][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:12:23,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:12:23,733][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:12:23,734][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:12:24,357][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:12:24,816][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:12:25,326][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:12:25,833][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:12:26,338][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:12:26,845][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:12:27,349][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:12:27,861][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:12:28,367][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:12:28,871][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:12:29,374][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:12:29,878][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:12:30,390][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:12:30,894][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:12:31,400][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:12:31,905][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:12:32,408][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:12:32,911][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:12:33,415][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:12:33,916][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:12:34,422][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:12:34,925][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:12:35,430][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:12:35,931][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:12:36,433][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:12:36,947][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:12:37,451][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:12:37,966][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:12:38,471][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:12:38,975][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:12:39,480][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:12:39,983][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:12:40,489][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:12:40,995][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:12:41,500][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:12:42,008][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:12:42,511][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:12:43,014][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:12:43,518][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:12:44,021][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:12:44,527][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:12:45,031][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:12:45,533][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:12:46,039][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:12:46,544][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:12:47,046][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:12:47,548][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:12:48,046][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:12:48,564][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:12:49,067][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:12:49,568][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:12:50,069][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:12:50,571][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:12:51,074][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:12:51,581][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:12:52,081][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:12:52,589][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:12:53,090][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:12:53,591][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:12:54,091][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:12:54,591][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:12:55,095][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:12:55,595][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:12:56,095][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:12:56,599][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10841 tokens. [2025-11-13 00:12:57,234][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.50%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:32 [2025-11-13 00:12:58,002][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:12:58,003][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:12:58,004][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:12:58,938][__main__][INFO] - Iteration 140 took 50s (29.09% Gen, 69.08% Train). Generation: 14s, Training: 35s. Estimated remaining time: 40h 23m 46s. Estimated total time: 42h 27m 3s. Time estimates for 10 more iterations: 8m 29s, 100 more iterations: 1h 24m 54s, 500 more iterations: 7h 4m 30s. [2025-11-13 00:12:58,940][__main__][INFO] - Starting iteration 140. [2025-11-13 00:12:59,435][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 13 and human policies 1. [2025-11-13 00:12:59,436][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:13:08,560][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:13:10,685][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 1 y book, 10 z balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:13:12,789][__main__][INFO] - Number of regex retries in iteration 140: 2 [2025-11-13 00:13:12,789][__main__][INFO] - agents played in iteration 140 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:13:13,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:13:13,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:13:13,629][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:13:13,652][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:13:13,652][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:13:13,653][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:13:14,273][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:13:14,734][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:13:15,243][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:13:15,747][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:13:16,253][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:13:16,757][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:13:17,259][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:13:17,763][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:13:18,266][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:13:18,792][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:13:19,296][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:13:19,802][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:13:20,309][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:13:20,815][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:13:21,320][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:13:21,829][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:13:22,334][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:13:22,842][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:13:23,348][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:13:23,856][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:13:24,361][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:13:24,865][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:13:25,387][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:13:25,891][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:13:26,401][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:13:26,908][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:13:27,412][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:13:27,917][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:13:28,422][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:13:28,926][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:13:29,434][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:13:29,939][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:13:30,442][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:13:30,949][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:13:31,452][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:13:31,964][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:13:32,468][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:13:32,971][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:13:33,477][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:13:33,981][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:13:34,493][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:13:34,996][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:13:35,501][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:13:36,007][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:13:36,511][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:13:37,014][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:13:37,519][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:13:38,024][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:13:38,535][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:13:39,036][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:13:39,537][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:13:40,041][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:13:40,547][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:13:41,057][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:13:41,559][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:13:42,061][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:13:42,574][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:13:43,078][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:13:43,590][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:13:44,091][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:13:44,592][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:13:45,093][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:13:45,593][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:13:46,096][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:13:46,597][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10875 tokens. [2025-11-13 00:13:47,244][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:32 [2025-11-13 00:13:48,013][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:13:48,014][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:13:48,016][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:13:49,854][__main__][INFO] - Iteration 141 took 50s (26.49% Gen, 69.87% Train). Generation: 13s, Training: 35s. Estimated remaining time: 39h 56m 48s. Estimated total time: 42h 0m 56s. Time estimates for 10 more iterations: 8m 24s, 100 more iterations: 1h 24m 1s, 500 more iterations: 7h 0m 9s. [2025-11-13 00:13:49,856][__main__][INFO] - Starting iteration 141. [2025-11-13 00:13:50,329][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 00:13:50,330][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:14:04,882][__main__][INFO] - Number of regex retries in iteration 141: 0 [2025-11-13 00:14:04,883][__main__][INFO] - agents played in iteration 141 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:14:05,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:14:05,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:14:05,714][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:14:05,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:14:05,742][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:14:05,743][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:14:06,368][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:14:06,827][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:14:07,337][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:14:07,840][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:14:08,344][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:14:08,849][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:14:09,354][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:14:09,857][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:14:10,360][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:14:10,865][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:14:11,369][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:14:11,873][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:14:12,378][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:14:12,883][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:14:13,386][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:14:13,906][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:14:14,410][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:14:14,922][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:14:15,427][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:14:15,933][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:14:16,438][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:14:16,944][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:14:17,451][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:14:17,956][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:14:18,459][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:14:18,968][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:14:19,483][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:14:19,989][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:14:20,510][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:14:21,015][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:14:21,520][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:14:22,027][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:14:22,531][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:14:23,038][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:14:23,537][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:14:24,040][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:14:24,546][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:14:25,049][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:14:25,552][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:14:26,056][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:14:26,560][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:14:27,076][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:14:27,577][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:14:28,080][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:14:28,585][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:14:29,089][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:14:29,606][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:14:30,110][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:14:30,618][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:14:31,124][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:14:31,630][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:14:32,136][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:14:32,645][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:14:33,151][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:14:33,659][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:14:34,161][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:14:34,663][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:14:35,164][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:14:35,662][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:14:36,171][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:14:36,671][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:14:37,170][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:14:37,682][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:14:38,180][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:14:38,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10869 tokens. [2025-11-13 00:14:39,339][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:32 [2025-11-13 00:14:40,098][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:14:40,100][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:14:40,101][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:14:41,042][__main__][INFO] - Iteration 142 took 50s (28.70% Gen, 69.45% Train). Generation: 14s, Training: 35s. Estimated remaining time: 40h 10m 40s. Estimated total time: 42h 15m 40s. Time estimates for 10 more iterations: 8m 27s, 100 more iterations: 1h 24m 31s, 500 more iterations: 7h 2m 36s. [2025-11-13 00:14:41,044][__main__][INFO] - Starting iteration 142. [2025-11-13 00:14:41,516][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 00:14:41,517][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:14:53,241][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:14:54,551][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:14:55,402][__main__][INFO] - Number of regex retries in iteration 142: 2 [2025-11-13 00:14:55,403][__main__][INFO] - agents played in iteration 142 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:14:56,319][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:14:56,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:14:56,368][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:14:56,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:14:56,391][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:14:56,392][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:14:57,025][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:14:57,482][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:14:57,990][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:14:58,507][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:14:59,012][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:14:59,515][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:15:00,021][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:15:00,519][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:15:01,025][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:15:01,530][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:15:02,033][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:15:02,538][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:15:03,040][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:15:03,543][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:15:04,046][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:15:04,552][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:15:05,062][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:15:05,567][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:15:06,072][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:15:06,578][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:15:07,087][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:15:07,611][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:15:08,115][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:15:08,620][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:15:09,129][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:15:09,634][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:15:10,135][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:15:10,640][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:15:11,149][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:15:11,657][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:15:12,161][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:15:12,665][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:15:13,169][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:15:13,675][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:15:14,204][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:15:14,713][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:15:15,217][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:15:15,721][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:15:16,226][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:15:16,734][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:15:17,239][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:15:17,743][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:15:18,248][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:15:18,752][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:15:19,256][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:15:19,758][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:15:20,261][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:15:20,767][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:15:21,272][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:15:21,775][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:15:22,279][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:15:22,785][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:15:23,303][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:15:23,810][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:15:24,313][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:15:24,815][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:15:25,313][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:15:25,823][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:15:26,323][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:15:26,825][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:15:27,332][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:15:27,833][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:15:28,335][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:15:28,834][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:15:29,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10824 tokens. [2025-11-13 00:15:29,969][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:32 [2025-11-13 00:15:30,723][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:15:30,725][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:15:30,726][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:15:31,639][__main__][INFO] - Iteration 143 took 50s (27.70% Gen, 70.47% Train). Generation: 13s, Training: 35s. Estimated remaining time: 39h 40m 22s. Estimated total time: 41h 46m 12s. Time estimates for 10 more iterations: 8m 21s, 100 more iterations: 1h 23m 32s, 500 more iterations: 6h 57m 42s. [2025-11-13 00:15:31,641][__main__][INFO] - Starting iteration 143. [2025-11-13 00:15:32,133][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 00:15:32,133][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:15:37,072][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:15:39,170][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:15:47,369][__main__][INFO] - Number of regex retries in iteration 143: 2 [2025-11-13 00:15:47,369][__main__][INFO] - agents played in iteration 143 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:15:48,157][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:15:48,184][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:15:48,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:15:48,231][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:15:48,232][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:15:48,233][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:15:48,852][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:15:49,310][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:15:49,816][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:15:50,319][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:15:50,834][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:15:51,338][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:15:51,841][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:15:52,342][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:15:52,846][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:15:53,355][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:15:53,868][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:15:54,373][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:15:54,877][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:15:55,378][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:15:55,882][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:15:56,388][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:15:56,893][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:15:57,408][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:15:57,914][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:15:58,418][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:15:58,923][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:15:59,429][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:15:59,935][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:16:00,440][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:16:00,944][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:16:01,453][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:16:01,963][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:16:02,466][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:16:02,969][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:16:03,470][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:16:03,976][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:16:04,479][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:16:04,982][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:16:05,488][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:16:05,991][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:16:06,496][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:16:06,999][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:16:07,502][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:16:08,005][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:16:08,508][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:16:09,011][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:16:09,517][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:16:10,020][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:16:10,522][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:16:11,025][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:16:11,529][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:16:12,036][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:16:12,540][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:16:13,044][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:16:13,549][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:16:14,055][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:16:14,563][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:16:15,068][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:16:15,571][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:16:16,072][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:16:16,576][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:16:17,083][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:16:17,587][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:16:18,092][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:16:18,596][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:16:19,097][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:16:19,607][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:16:20,110][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:16:20,610][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:16:21,116][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10842 tokens. [2025-11-13 00:16:21,754][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.50%, ΔTime: 00:00:32 [2025-11-13 00:16:22,524][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:16:22,525][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:16:22,527][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:16:23,454][__main__][INFO] - Iteration 144 took 51s (29.69% Gen, 68.51% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 39m 22s. Estimated total time: 42h 46m 4s. Time estimates for 10 more iterations: 8m 33s, 100 more iterations: 1h 25m 32s, 500 more iterations: 7h 7m 40s. [2025-11-13 00:16:23,456][__main__][INFO] - Starting iteration 144. [2025-11-13 00:16:23,941][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 00:16:23,941][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:16:33,746][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:16:39,775][__main__][INFO] - Number of regex retries in iteration 144: 1 [2025-11-13 00:16:39,776][__main__][INFO] - agents played in iteration 144 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:16:40,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:16:40,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:16:40,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:16:40,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:16:40,688][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:16:40,689][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:16:41,301][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:16:41,765][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:16:42,272][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:16:42,773][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:16:43,278][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:16:43,786][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:16:44,292][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:16:44,797][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:16:45,300][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:16:45,805][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:16:46,308][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:16:46,811][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:16:47,334][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:16:47,838][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:16:48,347][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:16:48,856][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:16:49,362][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:16:49,872][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:16:50,380][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:16:50,887][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:16:51,396][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:16:51,900][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:16:52,409][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:16:52,913][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:16:53,418][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:16:53,923][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:16:54,428][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:16:54,934][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:16:55,438][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:16:55,940][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:16:56,450][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:16:56,952][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:16:57,457][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:16:57,961][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:16:58,466][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:16:58,973][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:16:59,477][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:16:59,980][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:17:00,485][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:17:00,986][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:17:01,489][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:17:01,993][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:17:02,495][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:17:03,003][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:17:03,508][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:17:04,010][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:17:04,512][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:17:05,014][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:17:05,536][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:17:06,041][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:17:06,547][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:17:07,050][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:17:07,552][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:17:08,061][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:17:08,566][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:17:09,071][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:17:09,577][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:17:10,077][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:17:10,577][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:17:11,079][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:17:11,580][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:17:12,080][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:17:12,577][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:17:13,080][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:17:13,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10872 tokens. [2025-11-13 00:17:14,224][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:32 [2025-11-13 00:17:14,976][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:17:14,978][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:17:14,979][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:17:15,892][__main__][INFO] - Iteration 145 took 51s (30.48% Gen, 67.76% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 10m 2s. Estimated total time: 43h 17m 36s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 35s, 500 more iterations: 7h 12m 56s. [2025-11-13 00:17:15,894][__main__][INFO] - Starting iteration 145. [2025-11-13 00:17:16,381][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 00:17:16,381][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:17:20,357][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:17:27,028][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:17:31,990][__main__][INFO] - Number of regex retries in iteration 145: 2 [2025-11-13 00:17:31,991][__main__][INFO] - agents played in iteration 145 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:17:32,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:17:32,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:17:32,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:17:32,922][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:17:32,923][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:17:32,923][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:17:33,594][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:17:34,060][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:17:34,577][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:17:35,082][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:17:35,585][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:17:36,089][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:17:36,591][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:17:37,095][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:17:37,598][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:17:38,100][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:17:38,609][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:17:39,115][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:17:39,619][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:17:40,122][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:17:40,628][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:17:41,145][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:17:41,651][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:17:42,158][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:17:42,663][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:17:43,168][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:17:43,678][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:17:44,183][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:17:44,687][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:17:45,192][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:17:45,695][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:17:46,208][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:17:46,712][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:17:47,214][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:17:47,720][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:17:48,224][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:17:48,729][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:17:49,233][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:17:49,736][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:17:50,253][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:17:50,758][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:17:51,262][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:17:51,767][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:17:52,270][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:17:52,771][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:17:53,276][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:17:53,779][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:17:54,286][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:17:54,793][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:17:55,295][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:17:55,799][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:17:56,302][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:17:56,814][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:17:57,317][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:17:57,822][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:17:58,332][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:17:58,838][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:17:59,339][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:17:59,842][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:18:00,344][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:18:00,856][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:18:01,355][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:18:01,855][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:18:02,358][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:18:02,861][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:18:03,367][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:18:03,868][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:18:04,369][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:18:04,869][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:18:05,369][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:18:05,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10846 tokens. [2025-11-13 00:18:06,507][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:32 [2025-11-13 00:18:07,287][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:18:07,288][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:18:07,290][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:18:08,215][__main__][INFO] - Iteration 146 took 51s (30.11% Gen, 68.10% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 3m 16s. Estimated total time: 43h 11m 43s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 23s, 500 more iterations: 7h 11m 57s. [2025-11-13 00:18:08,217][__main__][INFO] - Starting iteration 146. [2025-11-13 00:18:08,687][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 00:18:08,688][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:18:23,036][__main__][INFO] - Number of regex retries in iteration 146: 0 [2025-11-13 00:18:23,037][__main__][INFO] - agents played in iteration 146 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:18:23,838][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:18:23,861][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:18:23,884][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:18:23,906][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:18:23,907][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:18:23,908][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:18:24,584][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:18:25,046][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:18:25,554][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:18:26,056][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:18:26,557][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:18:27,061][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:18:27,561][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:18:28,063][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:18:28,569][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:18:29,071][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:18:29,570][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:18:30,072][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:18:30,575][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:18:31,082][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:18:31,584][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:18:32,087][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:18:32,590][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:18:33,093][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:18:33,599][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:18:34,102][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:18:34,604][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:18:35,112][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:18:35,616][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:18:36,122][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:18:36,627][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:18:37,131][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:18:37,653][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:18:38,160][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:18:38,665][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:18:39,168][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:18:39,673][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:18:40,179][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:18:40,680][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:18:41,184][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:18:41,688][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:18:42,191][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:18:42,694][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:18:43,195][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:18:43,696][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:18:44,201][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:18:44,706][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:18:45,209][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:18:45,711][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:18:46,215][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:18:46,724][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:18:47,228][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:18:47,732][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:18:48,257][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:18:48,762][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:18:49,268][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:18:49,772][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:18:50,273][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:18:50,776][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:18:51,278][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:18:51,780][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:18:52,285][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:18:52,787][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:18:53,296][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:18:53,800][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:18:54,305][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:18:54,812][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:18:55,315][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:18:55,818][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:18:56,336][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:18:56,843][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10796 tokens. [2025-11-13 00:18:57,547][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.28%, Current % of VRAM taken: 58.53%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:32 [2025-11-13 00:18:58,318][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:18:58,319][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:18:58,321][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:18:59,266][__main__][INFO] - Iteration 147 took 50s (28.37% Gen, 69.76% Train). Generation: 14s, Training: 35s. Estimated remaining time: 39h 59m 40s. Estimated total time: 42h 8m 58s. Time estimates for 10 more iterations: 8m 25s, 100 more iterations: 1h 24m 17s, 500 more iterations: 7h 1m 29s. [2025-11-13 00:18:59,268][__main__][INFO] - Starting iteration 147. [2025-11-13 00:18:59,777][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 00:18:59,778][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:19:03,245][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:19:13,931][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:19:14,922][__main__][INFO] - Number of regex retries in iteration 147: 2 [2025-11-13 00:19:14,923][__main__][INFO] - agents played in iteration 147 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:19:15,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:19:15,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:19:15,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:19:15,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:19:15,824][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:19:15,825][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:19:16,443][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:19:16,903][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:19:17,414][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:19:17,915][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:19:18,422][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:19:18,933][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:19:19,438][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:19:19,959][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:19:20,461][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:19:20,967][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:19:21,471][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:19:21,980][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:19:22,488][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:19:22,992][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:19:23,500][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:19:24,013][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:19:24,514][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:19:25,028][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:19:25,535][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:19:26,039][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:19:26,549][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:19:27,053][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:19:27,557][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:19:28,059][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:19:28,560][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:19:29,072][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:19:29,576][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:19:30,079][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:19:30,584][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:19:31,088][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:19:31,594][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:19:32,096][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:19:32,601][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:19:33,111][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:19:33,615][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:19:34,126][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:19:34,630][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:19:35,134][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:19:35,659][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:19:36,161][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:19:36,666][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:19:37,165][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:19:37,668][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:19:38,173][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:19:38,675][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:19:39,178][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:19:39,685][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:19:40,191][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:19:40,703][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:19:41,206][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:19:41,708][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:19:42,220][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:19:42,721][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:19:43,232][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:19:43,735][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:19:44,236][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:19:44,755][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:19:45,256][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:19:45,762][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:19:46,267][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:19:46,792][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:19:47,295][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:19:47,798][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:19:48,301][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:19:48,819][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10820 tokens. [2025-11-13 00:19:49,539][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.29%, Current % of VRAM taken: 58.53%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 00:19:50,353][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:19:50,354][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:19:50,356][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:19:51,266][__main__][INFO] - Iteration 148 took 51s (29.41% Gen, 68.82% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 44m 17s. Estimated total time: 42h 54m 27s. Time estimates for 10 more iterations: 8m 34s, 100 more iterations: 1h 25m 48s, 500 more iterations: 7h 9m 4s. [2025-11-13 00:19:51,268][__main__][INFO] - Starting iteration 148. [2025-11-13 00:19:51,759][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 00:19:51,759][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:20:07,322][__main__][INFO] - Number of regex retries in iteration 148: 0 [2025-11-13 00:20:07,323][__main__][INFO] - agents played in iteration 148 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:20:08,159][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:20:08,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:20:08,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:20:08,236][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:20:08,237][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:20:08,238][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:20:08,857][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:20:09,320][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:20:09,842][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:20:10,345][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:20:10,848][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:20:11,356][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:20:11,859][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:20:12,366][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:20:12,869][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:20:13,371][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:20:13,876][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:20:14,378][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:20:14,881][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:20:15,386][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:20:15,890][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:20:16,395][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:20:16,898][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:20:17,404][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:20:17,910][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:20:18,414][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:20:18,919][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:20:19,423][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:20:19,927][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:20:20,431][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:20:20,933][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:20:21,435][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:20:21,942][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:20:22,449][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:20:22,969][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:20:23,475][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:20:23,981][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:20:24,481][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:20:24,987][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:20:25,492][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:20:25,996][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:20:26,500][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:20:27,004][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:20:27,508][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:20:28,012][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:20:28,518][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:20:29,025][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:20:29,554][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:20:30,059][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:20:30,566][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:20:31,074][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:20:31,581][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:20:32,091][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:20:32,597][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:20:33,103][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:20:33,612][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:20:34,114][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:20:34,622][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:20:35,128][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:20:35,632][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:20:36,153][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:20:36,657][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:20:37,160][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:20:37,662][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:20:38,166][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:20:38,673][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:20:39,176][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:20:39,678][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:20:40,188][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:20:40,692][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:20:41,196][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10832 tokens. [2025-11-13 00:20:41,941][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.50%, ΔTime: 00:00:33 [2025-11-13 00:20:42,759][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:20:42,761][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:20:42,763][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:20:43,691][__main__][INFO] - Iteration 149 took 51s (29.97% Gen, 68.24% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 5m 35s. Estimated total time: 43h 16m 37s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 33s, 500 more iterations: 7h 12m 46s. [2025-11-13 00:20:43,693][__main__][INFO] - Starting iteration 149. [2025-11-13 00:20:44,165][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 00:20:44,165][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:20:57,510][__main__][INFO] - Number of regex retries in iteration 149: 0 [2025-11-13 00:20:57,511][__main__][INFO] - agents played in iteration 149 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:20:58,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:20:58,320][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:20:58,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:20:58,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:20:58,368][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:20:58,369][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:20:58,991][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:20:59,451][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:20:59,967][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:21:00,475][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:21:00,991][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:21:01,492][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:21:01,994][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:21:02,497][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:21:02,998][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:21:03,501][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:21:04,009][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:21:04,519][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:21:05,024][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:21:05,532][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:21:06,037][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:21:06,549][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:21:07,056][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:21:07,569][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:21:08,074][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:21:08,575][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:21:09,090][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:21:09,595][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:21:10,100][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:21:10,606][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:21:11,110][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:21:11,615][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:21:12,116][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:21:12,618][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:21:13,118][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:21:13,622][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:21:14,121][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:21:14,629][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:21:15,133][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:21:15,644][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:21:16,151][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:21:16,655][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:21:17,160][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:21:17,666][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:21:18,188][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:21:18,693][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:21:19,199][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:21:19,701][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:21:20,203][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:21:20,711][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:21:21,213][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:21:21,715][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:21:22,218][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:21:22,718][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:21:23,222][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:21:23,725][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:21:24,228][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:21:24,735][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:21:25,239][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:21:25,744][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:21:26,259][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:21:26,761][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:21:27,271][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:21:27,772][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:21:28,274][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:21:28,777][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:21:29,278][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:21:29,778][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:21:30,281][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:21:30,783][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:21:31,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10853 tokens. [2025-11-13 00:21:31,940][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:32 [2025-11-13 00:21:32,702][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:21:32,704][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:21:32,706][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:21:33,611][__main__][INFO] - Iteration 150 took 49s (26.99% Gen, 71.18% Train). Generation: 13s, Training: 35s. Estimated remaining time: 39h 0m 28s. Estimated total time: 41h 12m 20s. Time estimates for 10 more iterations: 8m 14s, 100 more iterations: 1h 22m 24s, 500 more iterations: 6h 52m 3s. [2025-11-13 00:21:33,614][__main__][INFO] - Starting iteration 150. [2025-11-13 00:21:34,095][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 14 and human policies 1. [2025-11-13 00:21:34,095][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:21:48,942][__main__][INFO] - Number of regex retries in iteration 150: 0 [2025-11-13 00:21:48,942][__main__][INFO] - agents played in iteration 150 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:21:49,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:21:49,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:21:49,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:21:49,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:21:49,846][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:21:49,847][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:21:50,478][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:21:50,943][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:21:51,455][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:21:51,966][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:21:52,477][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:21:52,984][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:21:53,494][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:21:53,997][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:21:54,501][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:21:55,007][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:21:55,505][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:21:56,004][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:21:56,507][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:21:57,007][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:21:57,510][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:21:58,009][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:21:58,509][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:21:59,014][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:21:59,518][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:22:00,021][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:22:00,525][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:22:01,030][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:22:01,534][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:22:02,039][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:22:02,542][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:22:03,048][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:22:03,553][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:22:04,066][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:22:04,569][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:22:05,072][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:22:05,579][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:22:06,080][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:22:06,594][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:22:07,095][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:22:07,597][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:22:08,101][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:22:08,603][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:22:09,109][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:22:09,614][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:22:10,118][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:22:10,624][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:22:11,126][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:22:11,631][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:22:12,139][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:22:12,643][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:22:13,150][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:22:13,656][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:22:14,159][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:22:14,674][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:22:15,177][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:22:15,695][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:22:16,198][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:22:16,702][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:22:17,207][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:22:17,711][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:22:18,215][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:22:18,717][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:22:19,218][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:22:19,722][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:22:20,225][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:22:20,728][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:22:21,233][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:22:21,735][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:22:22,238][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:22:22,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10851 tokens. [2025-11-13 00:22:23,390][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.27%, Current % of VRAM taken: 58.51%, Block Peak % of device VRAM: 62.52%, ΔTime: 00:00:32 [2025-11-13 00:22:24,155][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:22:24,156][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:22:24,158][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:22:25,932][__main__][INFO] - Iteration 151 took 51s (28.64% Gen, 67.93% Train). Generation: 14s, Training: 35s. Estimated remaining time: 40h 59m 11s. Estimated total time: 43h 11m 55s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 23s, 500 more iterations: 7h 11m 59s. [2025-11-13 00:22:25,934][__main__][INFO] - Starting iteration 151. [2025-11-13 00:22:26,406][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 00:22:26,407][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:22:41,897][__main__][INFO] - Number of regex retries in iteration 151: 0 [2025-11-13 00:22:41,897][__main__][INFO] - agents played in iteration 151 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:22:42,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:22:42,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:22:42,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:22:42,851][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:22:42,851][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:22:42,853][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:22:43,488][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:22:43,947][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:22:44,453][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:22:44,957][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:22:45,461][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:22:45,977][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:22:46,477][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:22:46,979][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:22:47,484][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:22:47,982][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:22:48,497][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:22:48,998][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:22:49,500][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:22:50,014][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:22:50,518][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:22:51,022][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:22:51,530][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:22:52,034][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:22:52,540][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:22:53,046][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:22:53,549][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:22:54,054][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:22:54,558][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:22:55,061][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:22:55,565][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:22:56,068][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:22:56,578][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:22:57,080][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:22:57,582][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:22:58,091][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:22:58,592][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:22:59,104][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:22:59,605][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:23:00,111][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:23:00,617][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:23:01,122][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:23:01,628][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:23:02,133][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:23:02,639][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:23:03,145][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:23:03,647][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:23:04,176][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:23:04,682][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:23:05,188][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:23:05,708][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:23:06,213][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:23:06,721][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:23:07,228][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:23:07,731][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:23:08,232][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:23:08,736][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:23:09,248][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:23:09,751][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:23:10,254][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:23:10,756][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:23:11,258][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:23:11,760][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:23:12,264][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:23:12,767][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:23:13,268][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:23:13,786][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:23:14,290][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:23:14,804][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:23:15,306][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:23:15,811][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10858 tokens. [2025-11-13 00:23:16,553][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 00:23:17,331][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:23:17,332][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:23:17,334][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:23:18,255][__main__][INFO] - Iteration 152 took 51s (29.88% Gen, 68.35% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 58m 50s. Estimated total time: 43h 12m 26s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 24s, 500 more iterations: 7h 12m 4s. [2025-11-13 00:23:18,258][__main__][INFO] - Starting iteration 152. [2025-11-13 00:23:18,736][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 00:23:18,737][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:23:30,681][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:23:33,172][__main__][INFO] - Number of regex retries in iteration 152: 1 [2025-11-13 00:23:33,173][__main__][INFO] - agents played in iteration 152 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:23:33,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:23:33,973][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:23:33,998][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:23:34,020][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:23:34,021][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:23:34,022][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:23:34,645][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:23:35,103][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:23:35,607][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:23:36,122][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:23:36,623][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:23:37,124][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:23:37,629][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:23:38,130][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:23:38,631][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:23:39,134][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:23:39,635][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:23:40,136][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:23:40,636][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:23:41,137][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:23:41,637][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:23:42,143][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:23:42,649][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:23:43,152][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:23:43,656][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:23:44,162][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:23:44,666][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:23:45,168][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:23:45,686][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:23:46,190][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:23:46,707][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:23:47,214][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:23:47,716][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:23:48,222][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:23:48,726][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:23:49,229][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:23:49,731][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:23:50,229][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:23:50,732][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:23:51,235][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:23:51,737][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:23:52,238][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:23:52,741][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:23:53,246][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:23:53,748][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:23:54,251][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:23:54,757][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:23:55,262][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:23:55,767][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:23:56,280][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:23:56,785][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:23:57,296][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:23:57,803][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:23:58,310][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:23:58,815][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:23:59,318][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:23:59,823][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:24:00,325][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:24:00,834][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:24:01,337][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:24:01,840][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:24:02,344][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:24:02,851][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:24:03,352][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:24:03,856][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:24:04,357][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:24:04,860][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:24:05,364][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:24:05,869][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:24:06,374][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:24:06,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10833 tokens. [2025-11-13 00:24:07,603][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:32 [2025-11-13 00:24:08,386][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:24:08,388][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:24:08,390][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:24:09,276][__main__][INFO] - Iteration 153 took 50s (28.56% Gen, 69.68% Train). Generation: 14s, Training: 35s. Estimated remaining time: 39h 52m 34s. Estimated total time: 42h 7m 2s. Time estimates for 10 more iterations: 8m 25s, 100 more iterations: 1h 24m 14s, 500 more iterations: 7h 1m 10s. [2025-11-13 00:24:09,278][__main__][INFO] - Starting iteration 153. [2025-11-13 00:24:09,760][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 00:24:09,760][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:24:18,026][mllm.models.large_language_model_local][WARNING] - Response Proposals are as follows: Alice: 10 hats, 10 books, 10 balls Bob: 10 hats, 10 books, 10 balls In this round, both Alice and Bob have a similar strategy of proposing to take all of each item. Given the values, both Alice and Bob are willing to take all available items to maximize their points for this round. Since the item quantities match the total proposals, both Alice and Bob will receive the full amount they proposed, ensuring they each get the maximum possible points for this round. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:24:24,258][__main__][INFO] - Number of regex retries in iteration 153: 1 [2025-11-13 00:24:24,258][__main__][INFO] - agents played in iteration 153 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:24:25,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:24:25,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:24:25,240][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:24:25,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:24:25,263][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:24:25,264][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:24:25,870][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:24:26,332][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:24:26,840][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:24:27,342][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:24:27,848][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:24:28,351][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:24:28,854][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:24:29,360][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:24:29,862][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:24:30,366][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:24:30,869][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:24:31,377][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:24:31,879][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:24:32,382][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:24:32,895][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:24:33,399][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:24:33,905][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:24:34,409][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:24:34,914][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:24:35,424][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:24:35,926][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:24:36,431][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:24:36,937][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:24:37,442][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:24:37,948][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:24:38,451][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:24:38,954][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:24:39,458][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:24:39,960][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:24:40,462][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:24:40,961][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:24:41,463][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:24:41,967][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:24:42,470][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:24:42,973][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:24:43,486][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:24:43,989][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:24:44,495][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:24:44,996][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:24:45,497][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:24:46,013][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:24:46,515][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:24:47,016][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:24:47,519][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:24:48,025][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:24:48,534][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:24:49,038][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:24:49,545][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:24:50,051][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:24:50,555][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:24:51,058][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:24:51,562][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:24:52,067][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:24:52,574][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:24:53,077][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:24:53,582][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:24:54,094][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:24:54,595][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:24:55,095][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:24:55,596][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:24:56,093][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:24:56,598][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:24:57,098][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:24:57,602][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:24:58,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10855 tokens. [2025-11-13 00:24:58,810][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:32 [2025-11-13 00:24:59,593][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:24:59,595][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:24:59,596][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:25:00,532][__main__][INFO] - Iteration 154 took 50s (28.56% Gen, 69.60% Train). Generation: 14s, Training: 35s. Estimated remaining time: 40h 3m 21s. Estimated total time: 42h 18m 40s. Time estimates for 10 more iterations: 8m 27s, 100 more iterations: 1h 24m 37s, 500 more iterations: 7h 3m 6s. [2025-11-13 00:25:00,534][__main__][INFO] - Starting iteration 154. [2025-11-13 00:25:01,000][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 00:25:01,000][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:25:16,069][__main__][INFO] - Number of regex retries in iteration 154: 0 [2025-11-13 00:25:16,070][__main__][INFO] - agents played in iteration 154 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:25:16,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:25:16,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:25:16,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:25:16,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:25:16,972][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:25:16,974][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:25:17,601][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:25:18,061][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:25:18,565][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:25:19,067][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:25:19,569][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:25:20,068][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:25:20,571][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:25:21,073][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:25:21,576][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:25:22,077][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:25:22,581][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:25:23,084][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:25:23,586][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:25:24,089][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:25:24,590][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:25:25,093][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:25:25,594][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:25:26,094][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:25:26,618][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:25:27,123][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:25:27,627][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:25:28,133][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:25:28,638][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:25:29,140][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:25:29,643][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:25:30,149][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:25:30,656][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:25:31,160][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:25:31,664][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:25:32,171][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:25:32,676][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:25:33,184][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:25:33,689][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:25:34,192][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:25:34,706][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:25:35,209][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:25:35,712][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:25:36,218][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:25:36,721][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:25:37,231][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:25:37,734][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:25:38,239][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:25:38,748][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:25:39,255][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:25:39,760][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:25:40,264][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:25:40,774][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:25:41,282][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:25:41,789][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:25:42,294][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:25:42,802][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:25:43,311][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:25:43,830][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:25:44,334][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:25:44,840][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:25:45,354][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:25:45,864][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:25:46,370][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:25:46,872][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:25:47,375][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:25:47,880][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:25:48,386][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:25:48,890][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:25:49,397][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:25:49,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10854 tokens. [2025-11-13 00:25:50,611][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 00:25:51,352][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:25:51,354][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:25:51,356][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:25:52,249][__main__][INFO] - Iteration 155 took 51s (29.40% Gen, 68.85% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 26m 20s. Estimated total time: 42h 42m 30s. Time estimates for 10 more iterations: 8m 32s, 100 more iterations: 1h 25m 25s, 500 more iterations: 7h 7m 5s. [2025-11-13 00:25:52,251][__main__][INFO] - Starting iteration 155. [2025-11-13 00:25:52,737][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 00:25:52,738][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:26:06,519][__main__][INFO] - Number of regex retries in iteration 155: 0 [2025-11-13 00:26:06,520][__main__][INFO] - agents played in iteration 155 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:26:07,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:26:07,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:26:07,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:26:07,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:26:07,378][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:26:07,378][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:26:07,993][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:26:08,455][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:26:08,960][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:26:09,461][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:26:09,963][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:26:10,464][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:26:10,967][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:26:11,470][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:26:11,974][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:26:12,478][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:26:12,982][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:26:13,483][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:26:13,985][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:26:14,496][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:26:14,999][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:26:15,515][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:26:16,017][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:26:16,541][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:26:17,043][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:26:17,550][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:26:18,049][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:26:18,551][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:26:19,059][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:26:19,563][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:26:20,068][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:26:20,574][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:26:21,076][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:26:21,581][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:26:22,096][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:26:22,601][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:26:23,120][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:26:23,626][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:26:24,142][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:26:24,650][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:26:25,156][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:26:25,663][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:26:26,168][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:26:26,674][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:26:27,198][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:26:27,709][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:26:28,239][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:26:28,745][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:26:29,252][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:26:29,756][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:26:30,259][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:26:30,763][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:26:31,267][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:26:31,771][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:26:32,274][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:26:32,779][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:26:33,282][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:26:33,788][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:26:34,292][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:26:34,799][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:26:35,306][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:26:35,811][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:26:36,320][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:26:36,828][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:26:37,335][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:26:37,841][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:26:38,348][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:26:38,859][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:26:39,367][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:26:39,873][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:26:40,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10839 tokens. [2025-11-13 00:26:41,061][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.29%, Current % of VRAM taken: 58.54%, Block Peak % of device VRAM: 62.51%, ΔTime: 00:00:33 [2025-11-13 00:26:41,840][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:26:41,842][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:26:41,843][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:26:42,775][__main__][INFO] - Iteration 156 took 50s (27.54% Gen, 70.59% Train). Generation: 13s, Training: 35s. Estimated remaining time: 39h 24m 52s. Estimated total time: 41h 41m 53s. Time estimates for 10 more iterations: 8m 20s, 100 more iterations: 1h 23m 23s, 500 more iterations: 6h 56m 58s. [2025-11-13 00:26:42,777][__main__][INFO] - Starting iteration 156. [2025-11-13 00:26:43,250][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 00:26:43,251][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:26:56,670][__main__][INFO] - Number of regex retries in iteration 156: 0 [2025-11-13 00:26:56,671][__main__][INFO] - agents played in iteration 156 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:26:57,630][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:26:57,657][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:26:57,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:26:57,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:26:57,707][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:26:57,708][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:26:58,382][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:26:58,853][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:26:59,363][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:26:59,864][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:27:00,370][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:27:00,873][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:27:01,377][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:27:01,879][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:27:02,380][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:27:02,885][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:27:03,388][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:27:03,890][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:27:04,395][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:27:04,897][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:27:05,399][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:27:05,904][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:27:06,407][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:27:06,923][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:27:07,426][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:27:07,940][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:27:08,443][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:27:08,947][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:27:09,459][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:27:09,963][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:27:10,470][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:27:10,976][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:27:11,479][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:27:11,986][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:27:12,489][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:27:12,992][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:27:13,497][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:27:14,000][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:27:14,507][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:27:15,011][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:27:15,513][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:27:16,016][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:27:16,516][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:27:17,017][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:27:17,529][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:27:18,031][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:27:18,542][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:27:19,045][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:27:19,547][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:27:20,052][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:27:20,555][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:27:21,057][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:27:21,559][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:27:22,061][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:27:22,566][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:27:23,069][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:27:23,577][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:27:24,080][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:27:24,581][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:27:25,087][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:27:25,592][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:27:26,096][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:27:26,601][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:27:27,106][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:27:27,614][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:27:28,119][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:27:28,624][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:27:29,143][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:27:29,646][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:27:30,151][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:27:30,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10869 tokens. [2025-11-13 00:27:31,371][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:32 [2025-11-13 00:27:32,145][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:27:32,146][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:27:32,148][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:27:33,044][__main__][INFO] - Iteration 157 took 49s (26.95% Gen, 71.25% Train). Generation: 13s, Training: 35s. Estimated remaining time: 39h 11m 51s. Estimated total time: 41h 29m 43s. Time estimates for 10 more iterations: 8m 17s, 100 more iterations: 1h 22m 59s, 500 more iterations: 6h 54m 57s. [2025-11-13 00:27:33,047][__main__][INFO] - Starting iteration 157. [2025-11-13 00:27:33,528][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 00:27:33,529][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:27:37,479][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:27:49,142][__main__][INFO] - Number of regex retries in iteration 157: 1 [2025-11-13 00:27:49,142][__main__][INFO] - agents played in iteration 157 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:27:49,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:27:50,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:27:50,037][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:27:50,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:27:50,060][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:27:50,062][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:27:50,667][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:27:51,126][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:27:51,637][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:27:52,138][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:27:52,648][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:27:53,149][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:27:53,652][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:27:54,164][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:27:54,665][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:27:55,168][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:27:55,670][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:27:56,172][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:27:56,676][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:27:57,177][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:27:57,678][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:27:58,180][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:27:58,680][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:27:59,186][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:27:59,688][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:28:00,188][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:28:00,694][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:28:01,197][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:28:01,701][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:28:02,207][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:28:02,711][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:28:03,219][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:28:03,726][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:28:04,232][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:28:04,752][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:28:05,258][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:28:05,768][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:28:06,275][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:28:06,781][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:28:07,282][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:28:07,787][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:28:08,290][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:28:08,792][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:28:09,293][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:28:09,798][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:28:10,300][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:28:10,800][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:28:11,309][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:28:11,837][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:28:12,342][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:28:12,848][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:28:13,355][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:28:13,867][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:28:14,372][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:28:14,878][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:28:15,379][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:28:15,883][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:28:16,387][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:28:16,892][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:28:17,396][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:28:17,903][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:28:18,408][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:28:18,911][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:28:19,415][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:28:19,921][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:28:20,429][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:28:20,936][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:28:21,445][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:28:21,952][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:28:22,455][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:28:22,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10865 tokens. [2025-11-13 00:28:23,698][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.28%, Current % of VRAM taken: 58.52%, Block Peak % of device VRAM: 62.49%, ΔTime: 00:00:33 [2025-11-13 00:28:24,465][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:28:24,466][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:28:24,468][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:28:25,337][__main__][INFO] - Iteration 158 took 51s (30.14% Gen, 68.18% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 51m 45s. Estimated total time: 43h 10m 29s. Time estimates for 10 more iterations: 8m 38s, 100 more iterations: 1h 26m 20s, 500 more iterations: 7h 11m 44s. [2025-11-13 00:28:25,339][__main__][INFO] - Starting iteration 158. [2025-11-13 00:28:25,819][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 00:28:25,820][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:28:40,452][__main__][INFO] - Number of regex retries in iteration 158: 0 [2025-11-13 00:28:40,453][__main__][INFO] - agents played in iteration 158 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:28:41,232][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:28:41,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:28:41,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:28:41,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:28:41,302][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:28:41,303][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:28:41,912][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:28:42,372][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:28:42,877][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:28:43,378][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:28:43,890][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:28:44,392][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:28:44,895][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:28:45,400][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:28:45,902][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:28:46,407][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:28:46,914][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:28:47,417][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:28:47,923][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:28:48,428][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:28:48,931][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:28:49,444][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:28:49,949][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:28:50,465][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:28:50,975][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:28:51,479][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:28:51,983][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:28:52,487][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:28:52,990][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:28:53,494][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:28:53,999][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:28:54,505][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:28:55,011][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:28:55,518][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:28:56,031][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:28:56,537][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:28:57,051][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:28:57,557][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:28:58,062][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:28:58,580][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:28:59,083][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:28:59,588][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:29:00,093][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:29:00,597][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:29:01,103][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:29:01,604][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:29:02,110][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:29:02,613][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:29:03,116][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:29:03,620][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:29:04,118][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:29:04,619][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:29:05,134][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:29:05,635][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:29:06,148][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:29:06,656][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:29:07,180][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:29:07,683][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:29:08,189][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:29:08,692][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:29:09,197][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:29:09,702][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:29:10,208][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:29:10,709][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:29:11,215][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:29:11,728][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:29:12,234][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:29:12,745][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:29:13,250][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:29:13,754][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:29:14,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10855 tokens. [2025-11-13 00:29:14,975][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:33 [2025-11-13 00:29:15,739][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:29:15,741][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:29:15,742][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:29:16,660][__main__][INFO] - Iteration 159 took 50s (28.78% Gen, 69.41% Train). Generation: 14s, Training: 35s. Estimated remaining time: 40h 2m 29s. Estimated total time: 42h 22m 4s. Time estimates for 10 more iterations: 8m 28s, 100 more iterations: 1h 24m 44s, 500 more iterations: 7h 3m 40s. [2025-11-13 00:29:16,662][__main__][INFO] - Starting iteration 159. [2025-11-13 00:29:17,144][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 00:29:17,144][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:29:31,468][__main__][INFO] - Number of regex retries in iteration 159: 0 [2025-11-13 00:29:31,469][__main__][INFO] - agents played in iteration 159 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:29:32,334][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:29:32,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:29:32,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:29:32,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:29:32,424][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:29:32,425][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:29:33,087][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:29:33,552][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:29:34,074][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:29:34,577][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:29:35,081][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:29:35,580][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:29:36,082][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:29:36,584][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:29:37,086][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:29:37,587][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:29:38,090][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:29:38,592][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:29:39,093][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:29:39,595][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:29:40,096][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:29:40,601][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:29:41,107][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:29:41,609][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:29:42,111][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:29:42,612][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:29:43,115][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:29:43,619][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:29:44,126][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:29:44,630][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:29:45,135][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:29:45,639][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:29:46,143][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:29:46,648][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:29:47,174][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:29:47,678][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:29:48,183][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:29:48,686][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:29:49,192][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:29:49,698][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:29:50,203][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:29:50,710][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:29:51,219][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:29:51,723][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:29:52,229][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:29:52,731][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:29:53,233][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:29:53,750][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:29:54,250][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:29:54,753][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:29:55,256][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:29:55,758][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:29:56,271][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:29:56,773][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:29:57,275][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:29:57,779][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:29:58,281][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:29:58,785][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:29:59,288][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:29:59,791][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:30:00,296][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:30:00,796][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:30:01,296][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:30:01,798][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:30:02,298][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:30:02,799][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:30:03,299][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:30:03,800][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:30:04,305][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:30:04,811][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:30:05,316][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10845 tokens. [2025-11-13 00:30:06,070][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:32 [2025-11-13 00:30:06,847][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:30:06,849][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:30:06,850][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:30:07,715][__main__][INFO] - Iteration 160 took 50s (28.32% Gen, 69.96% Train). Generation: 14s, Training: 35s. Estimated remaining time: 39h 48m 11s. Estimated total time: 42h 8m 37s. Time estimates for 10 more iterations: 8m 25s, 100 more iterations: 1h 24m 17s, 500 more iterations: 7h 1m 26s. [2025-11-13 00:30:07,718][__main__][INFO] - Starting iteration 160. [2025-11-13 00:30:08,223][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 15 and human policies 1. [2025-11-13 00:30:08,223][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:30:11,542][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:30:22,661][__main__][INFO] - Number of regex retries in iteration 160: 1 [2025-11-13 00:30:22,662][__main__][INFO] - agents played in iteration 160 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:30:23,502][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:30:23,529][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:30:23,555][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:30:23,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:30:23,579][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:30:23,580][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:30:24,238][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:30:24,697][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:30:25,207][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:30:25,710][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:30:26,213][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:30:26,721][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:30:27,228][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:30:27,733][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:30:28,234][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:30:28,736][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:30:29,239][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:30:29,742][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:30:30,245][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:30:30,759][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:30:31,261][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:30:31,772][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:30:32,274][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:30:32,775][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:30:33,280][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:30:33,781][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:30:34,283][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:30:34,787][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:30:35,288][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:30:35,795][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:30:36,296][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:30:36,798][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:30:37,302][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:30:37,810][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:30:38,317][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:30:38,819][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:30:39,323][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:30:39,830][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:30:40,334][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:30:40,838][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:30:41,351][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:30:41,857][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:30:42,364][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:30:42,869][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:30:43,374][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:30:43,879][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:30:44,385][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:30:44,894][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:30:45,398][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:30:45,902][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:30:46,407][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:30:46,910][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:30:47,414][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:30:47,917][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:30:48,419][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:30:48,924][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:30:49,435][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:30:49,940][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:30:50,454][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:30:50,957][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:30:51,466][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:30:51,971][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:30:52,474][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:30:52,999][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:30:53,518][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:30:54,022][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:30:54,529][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:30:55,036][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:30:55,542][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:30:56,048][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:30:56,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10840 tokens. [2025-11-13 00:30:57,285][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.26%, Current % of VRAM taken: 58.50%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:33 [2025-11-13 00:30:58,054][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:30:58,056][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:30:58,058][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:30:59,781][__main__][INFO] - Iteration 161 took 51s (28.00% Gen, 68.65% Train). Generation: 14s, Training: 35s. Estimated remaining time: 40h 36m 36s. Estimated total time: 42h 57m 54s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 55s, 500 more iterations: 7h 9m 39s. [2025-11-13 00:30:59,783][__main__][INFO] - Starting iteration 161. [2025-11-13 00:31:00,263][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 00:31:00,263][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:31:15,178][__main__][INFO] - Number of regex retries in iteration 161: 0 [2025-11-13 00:31:15,179][__main__][INFO] - agents played in iteration 161 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:31:15,952][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:31:15,979][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:31:16,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:31:16,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:31:16,027][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:31:16,028][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:31:16,659][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:31:17,119][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:31:17,629][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:31:18,133][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:31:18,636][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:31:19,141][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:31:19,647][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:31:20,154][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:31:20,662][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:31:21,164][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:31:21,668][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:31:22,169][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:31:22,670][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:31:23,189][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:31:23,689][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:31:24,190][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:31:24,691][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:31:25,192][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:31:25,693][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:31:26,193][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:31:26,693][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:31:27,204][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:31:27,707][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:31:28,207][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:31:28,709][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:31:29,211][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:31:29,716][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:31:30,216][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:31:30,722][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:31:31,226][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:31:31,729][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:31:32,234][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:31:32,737][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:31:33,241][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:31:33,749][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:31:34,257][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:31:34,767][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:31:35,272][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:31:35,777][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:31:36,284][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:31:36,792][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:31:37,295][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:31:37,797][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:31:38,299][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:31:38,808][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:31:39,312][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:31:39,814][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:31:40,314][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:31:40,816][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:31:41,321][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:31:41,823][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:31:42,325][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:31:42,828][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:31:43,329][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:31:43,832][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:31:44,334][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:31:44,835][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:31:45,338][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:31:45,841][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:31:46,346][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:31:46,850][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:31:47,354][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:31:47,860][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:31:48,368][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:31:48,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10865 tokens. [2025-11-13 00:31:49,592][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:32 [2025-11-13 00:31:50,335][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:31:50,337][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:31:50,339][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:31:51,207][__main__][INFO] - Iteration 162 took 50s (29.28% Gen, 69.02% Train). Generation: 14s, Training: 35s. Estimated remaining time: 40h 5m 2s. Estimated total time: 42h 27m 12s. Time estimates for 10 more iterations: 8m 29s, 100 more iterations: 1h 24m 54s, 500 more iterations: 7h 4m 32s. [2025-11-13 00:31:51,208][__main__][INFO] - Starting iteration 162. [2025-11-13 00:31:51,691][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 00:31:51,692][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:32:02,816][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:32:03,013][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:32:05,291][__main__][INFO] - Number of regex retries in iteration 162: 2 [2025-11-13 00:32:05,291][__main__][INFO] - agents played in iteration 162 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:32:06,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:32:06,155][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:32:06,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:32:06,206][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:32:06,207][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:32:06,208][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:32:06,836][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:32:07,296][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:32:07,804][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:32:08,311][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:32:08,818][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:32:09,320][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:32:09,822][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:32:10,327][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:32:10,830][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:32:11,333][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:32:11,834][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:32:12,335][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:32:12,839][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:32:13,344][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:32:13,848][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:32:14,351][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:32:14,852][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:32:15,358][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:32:15,863][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:32:16,366][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:32:16,873][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:32:17,375][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:32:17,876][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:32:18,378][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:32:18,881][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:32:19,419][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:32:19,924][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:32:20,428][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:32:20,931][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:32:21,433][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:32:21,934][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:32:22,440][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:32:22,949][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:32:23,455][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:32:23,967][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:32:24,477][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:32:24,984][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:32:25,492][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:32:26,014][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:32:26,520][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:32:27,029][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:32:27,533][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:32:28,041][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:32:28,549][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:32:29,053][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:32:29,556][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:32:30,060][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:32:30,564][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:32:31,070][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:32:31,577][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:32:32,082][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:32:32,593][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:32:33,096][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:32:33,608][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:32:34,113][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:32:34,614][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:32:35,128][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:32:35,631][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:32:36,135][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:32:36,640][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:32:37,146][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:32:37,652][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:32:38,159][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:32:38,662][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:32:39,168][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10849 tokens. [2025-11-13 00:32:39,954][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.52%, ΔTime: 00:00:33 [2025-11-13 00:32:40,690][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:32:40,692][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:32:40,693][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:32:41,606][__main__][INFO] - Iteration 163 took 49s (27.24% Gen, 70.92% Train). Generation: 13s, Training: 35s. Estimated remaining time: 39h 12m 46s. Estimated total time: 41h 35m 46s. Time estimates for 10 more iterations: 8m 19s, 100 more iterations: 1h 23m 11s, 500 more iterations: 6h 55m 57s. [2025-11-13 00:32:41,608][__main__][INFO] - Starting iteration 163. [2025-11-13 00:32:42,075][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 00:32:42,076][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:32:45,704][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:32:57,402][__main__][INFO] - Number of regex retries in iteration 163: 1 [2025-11-13 00:32:57,403][__main__][INFO] - agents played in iteration 163 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:32:58,254][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:32:58,282][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:32:58,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:32:58,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:32:58,344][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:32:58,345][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:32:58,984][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:32:59,444][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:32:59,957][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:33:00,460][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:33:00,960][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:33:01,466][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:33:01,969][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:33:02,471][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:33:02,974][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:33:03,475][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:33:03,978][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:33:04,478][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:33:04,980][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:33:05,493][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:33:05,995][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:33:06,510][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:33:07,013][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:33:07,517][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:33:08,021][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:33:08,527][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:33:09,032][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:33:09,538][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:33:10,047][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:33:10,550][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:33:11,054][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:33:11,556][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:33:12,066][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:33:12,568][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:33:13,070][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:33:13,576][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:33:14,080][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:33:14,600][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:33:15,105][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:33:15,610][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:33:16,113][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:33:16,615][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:33:17,128][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:33:17,631][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:33:18,139][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:33:18,642][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:33:19,145][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:33:19,658][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:33:20,162][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:33:20,664][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:33:21,168][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:33:21,670][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:33:22,172][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:33:22,674][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:33:23,175][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:33:23,681][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:33:24,185][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:33:24,688][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:33:25,203][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:33:25,704][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:33:26,216][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:33:26,721][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:33:27,229][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:33:27,733][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:33:28,238][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:33:28,740][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:33:29,244][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:33:29,747][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:33:30,255][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:33:30,762][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:33:31,269][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10858 tokens. [2025-11-13 00:33:31,966][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:32 [2025-11-13 00:33:32,753][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:33:32,755][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:33:32,757][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:33:33,738][__main__][INFO] - Iteration 164 took 51s (29.67% Gen, 68.43% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 39m 20s. Estimated total time: 43h 3m 12s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 6s, 500 more iterations: 7h 10m 32s. [2025-11-13 00:33:33,740][__main__][INFO] - Starting iteration 164. [2025-11-13 00:33:34,233][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 00:33:34,234][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:33:39,404][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:33:47,946][__main__][INFO] - Number of regex retries in iteration 164: 1 [2025-11-13 00:33:47,947][__main__][INFO] - agents played in iteration 164 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:33:48,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:33:48,755][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:33:48,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:33:48,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:33:48,804][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:33:48,805][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:33:49,435][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:33:49,912][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:33:50,418][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:33:50,922][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:33:51,429][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:33:51,932][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:33:52,441][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:33:52,942][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:33:53,445][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:33:53,950][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:33:54,452][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:33:54,954][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:33:55,456][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:33:55,958][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:33:56,463][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:33:56,966][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:33:57,468][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:33:57,970][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:33:58,472][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:33:58,976][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:33:59,477][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:33:59,982][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:34:00,493][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:34:00,994][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:34:01,500][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:34:02,001][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:34:02,507][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:34:03,015][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:34:03,518][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:34:04,020][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:34:04,521][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:34:05,022][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:34:05,534][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:34:06,037][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:34:06,543][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:34:07,047][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:34:07,552][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:34:08,057][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:34:08,562][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:34:09,064][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:34:09,574][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:34:10,076][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:34:10,580][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:34:11,085][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:34:11,587][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:34:12,104][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:34:12,608][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:34:13,109][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:34:13,615][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:34:14,117][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:34:14,626][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:34:15,130][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:34:15,633][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:34:16,135][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:34:16,639][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:34:17,142][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:34:17,650][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:34:18,159][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:34:18,664][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:34:19,170][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:34:19,674][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:34:20,184][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:34:20,690][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:34:21,207][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:34:21,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10869 tokens. [2025-11-13 00:34:22,455][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:33 [2025-11-13 00:34:23,193][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:34:23,195][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:34:23,197][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:34:24,151][__main__][INFO] - Iteration 165 took 49s (27.47% Gen, 70.62% Train). Generation: 13s, Training: 35s. Estimated remaining time: 39h 11m 13s. Estimated total time: 41h 35m 55s. Time estimates for 10 more iterations: 8m 19s, 100 more iterations: 1h 23m 11s, 500 more iterations: 6h 55m 59s. [2025-11-13 00:34:24,153][__main__][INFO] - Starting iteration 165. [2025-11-13 00:34:24,640][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 00:34:24,641][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:34:40,255][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:34:41,005][__main__][INFO] - Number of regex retries in iteration 165: 1 [2025-11-13 00:34:41,006][__main__][INFO] - agents played in iteration 165 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:34:41,846][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:34:41,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:34:41,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:34:41,921][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:34:41,921][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:34:41,922][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:34:42,579][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:34:43,041][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:34:43,548][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:34:44,053][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:34:44,555][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:34:45,060][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:34:45,563][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:34:46,066][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:34:46,568][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:34:47,071][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:34:47,574][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:34:48,080][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:34:48,581][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:34:49,084][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:34:49,589][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:34:50,091][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:34:50,594][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:34:51,095][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:34:51,600][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:34:52,104][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:34:52,606][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:34:53,109][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:34:53,611][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:34:54,133][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:34:54,635][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:34:55,138][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:34:55,640][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:34:56,142][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:34:56,643][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:34:57,150][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:34:57,653][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:34:58,160][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:34:58,664][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:34:59,168][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:34:59,673][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:35:00,176][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:35:00,682][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:35:01,187][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:35:01,688][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:35:02,196][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:35:02,701][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:35:03,206][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:35:03,707][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:35:04,209][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:35:04,729][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:35:05,234][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:35:05,737][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:35:06,251][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:35:06,755][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:35:07,260][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:35:07,764][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:35:08,267][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:35:08,771][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:35:09,279][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:35:09,784][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:35:10,289][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:35:10,792][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:35:11,297][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:35:11,799][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:35:12,312][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:35:12,824][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:35:13,332][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:35:13,841][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:35:14,347][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:35:14,854][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10874 tokens. [2025-11-13 00:35:15,526][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:32 [2025-11-13 00:35:16,297][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:35:16,299][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:35:16,300][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:35:17,238][__main__][INFO] - Iteration 166 took 52s (31.11% Gen, 67.10% Train). Generation: 16s, Training: 35s. Estimated remaining time: 41h 24m 19s. Estimated total time: 43h 49m 55s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 39s, 500 more iterations: 7h 18m 19s. [2025-11-13 00:35:17,240][__main__][INFO] - Starting iteration 166. [2025-11-13 00:35:17,727][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 00:35:17,728][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:35:21,332][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:35:26,587][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:35:29,370][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:35:31,287][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 11 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:35:32,750][__main__][INFO] - Number of regex retries in iteration 166: 4 [2025-11-13 00:35:32,750][__main__][INFO] - agents played in iteration 166 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:35:33,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:35:33,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:35:33,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:35:33,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:35:33,676][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:35:33,676][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:35:34,333][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:35:34,795][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:35:35,324][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:35:35,868][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:35:36,372][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:35:36,875][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:35:37,378][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:35:37,885][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:35:38,389][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:35:38,895][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:35:39,412][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:35:39,916][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:35:40,435][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:35:40,939][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:35:41,442][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:35:41,950][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:35:42,455][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:35:42,962][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:35:43,464][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:35:43,967][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:35:44,472][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:35:44,975][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:35:45,477][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:35:45,979][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:35:46,481][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:35:46,984][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:35:47,487][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:35:47,994][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:35:48,499][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:35:49,001][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:35:49,504][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:35:50,007][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:35:50,509][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:35:51,028][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:35:51,535][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:35:52,040][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:35:52,546][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:35:53,051][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:35:53,559][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:35:54,063][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:35:54,571][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:35:55,075][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:35:55,576][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:35:56,080][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:35:56,583][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:35:57,087][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:35:57,593][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:35:58,097][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:35:58,599][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:35:59,101][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:35:59,607][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:36:00,122][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:36:00,625][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:36:01,130][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:36:01,640][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:36:02,146][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:36:02,653][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:36:03,157][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:36:03,661][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:36:04,170][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:36:04,675][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:36:05,179][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:36:05,684][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:36:06,192][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:36:06,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10854 tokens. [2025-11-13 00:36:07,362][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:33 [2025-11-13 00:36:08,147][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:36:08,149][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:36:08,151][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:36:09,106][__main__][INFO] - Iteration 167 took 51s (29.24% Gen, 68.90% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 22m 29s. Estimated total time: 42h 48m 56s. Time estimates for 10 more iterations: 8m 33s, 100 more iterations: 1h 25m 37s, 500 more iterations: 7h 8m 9s. [2025-11-13 00:36:09,108][__main__][INFO] - Starting iteration 167. [2025-11-13 00:36:09,606][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 00:36:09,607][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:36:24,532][__main__][INFO] - Number of regex retries in iteration 167: 0 [2025-11-13 00:36:24,533][__main__][INFO] - agents played in iteration 167 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:36:25,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:36:25,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:36:25,398][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:36:25,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:36:25,421][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:36:25,422][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:36:26,093][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:36:26,554][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:36:27,063][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:36:27,574][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:36:28,077][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:36:28,580][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:36:29,085][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:36:29,588][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:36:30,093][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:36:30,595][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:36:31,098][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:36:31,606][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:36:32,108][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:36:32,610][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:36:33,123][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:36:33,627][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:36:34,150][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:36:34,652][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:36:35,154][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:36:35,655][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:36:36,156][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:36:36,659][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:36:37,161][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:36:37,664][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:36:38,172][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:36:38,678][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:36:39,182][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:36:39,686][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:36:40,190][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:36:40,695][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:36:41,200][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:36:41,705][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:36:42,208][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:36:42,746][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:36:43,260][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:36:43,766][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:36:44,273][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:36:44,780][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:36:45,290][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:36:45,799][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:36:46,302][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:36:46,809][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:36:47,316][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:36:47,821][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:36:48,328][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:36:48,842][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:36:49,348][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:36:49,852][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:36:50,355][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:36:50,860][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:36:51,373][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:36:51,880][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:36:52,389][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:36:52,895][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:36:53,401][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:36:53,907][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:36:54,412][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:36:54,916][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:36:55,432][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:36:55,936][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:36:56,440][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:36:56,949][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:36:57,455][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:36:57,971][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:36:58,475][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10863 tokens. [2025-11-13 00:36:59,184][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.26%, Current % of VRAM taken: 58.51%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:33 [2025-11-13 00:36:59,949][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:36:59,951][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:36:59,953][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:37:00,900][__main__][INFO] - Iteration 168 took 51s (29.10% Gen, 69.05% Train). Generation: 14s, Training: 35s. Estimated remaining time: 40h 17m 27s. Estimated total time: 42h 44m 46s. Time estimates for 10 more iterations: 8m 32s, 100 more iterations: 1h 25m 29s, 500 more iterations: 7h 7m 27s. [2025-11-13 00:37:00,903][__main__][INFO] - Starting iteration 168. [2025-11-13 00:37:01,382][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 00:37:01,382][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:37:05,994][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:37:11,724][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls Given the per-item values, I recognize that Bob values hats very highly (10) compared to how I value them (1), while books and balls are valued similarly between us. This information leads me to propose taking all the hats, as this maximizes the likelihood of obtaining the highest value items for me. I will not propose any books or balls, as this would significantly reduce my expected value, given my lower valuation for these items compared to Bob's. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:37:15,862][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:37:17,802][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:37:19,821][__main__][INFO] - Number of regex retries in iteration 168: 4 [2025-11-13 00:37:19,822][__main__][INFO] - agents played in iteration 168 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:37:20,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:37:20,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:37:20,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:37:20,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:37:20,758][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:37:20,759][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:37:21,408][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:37:21,878][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:37:22,388][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:37:22,891][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:37:23,398][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:37:23,907][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:37:24,413][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:37:24,916][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:37:25,420][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:37:25,921][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:37:26,426][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:37:26,933][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:37:27,445][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:37:27,950][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:37:28,459][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:37:28,963][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:37:29,468][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:37:29,975][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:37:30,477][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:37:30,980][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:37:31,483][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:37:31,986][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:37:32,490][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:37:32,994][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:37:33,497][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:37:34,002][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:37:34,507][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:37:35,015][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:37:35,521][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:37:36,031][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:37:36,543][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:37:37,049][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:37:37,560][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:37:38,064][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:37:38,571][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:37:39,080][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:37:39,583][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:37:40,089][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:37:40,591][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:37:41,095][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:37:41,598][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:37:42,099][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:37:42,603][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:37:43,108][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:37:43,612][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:37:44,134][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:37:44,644][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:37:45,153][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:37:45,669][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:37:46,176][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:37:46,683][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:37:47,192][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:37:47,697][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:37:48,205][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:37:48,710][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:37:49,215][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:37:49,727][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:37:50,232][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:37:50,758][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:37:51,263][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:37:51,776][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:37:52,282][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:37:52,787][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:37:53,294][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:37:53,803][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10805 tokens. [2025-11-13 00:37:54,488][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.33%, ΔTime: 00:00:33 [2025-11-13 00:37:55,272][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:37:55,273][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:37:55,275][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:37:56,286][__main__][INFO] - Iteration 169 took 54s (33.58% Gen, 64.57% Train). Generation: 18s, Training: 35s. Estimated remaining time: 43h 16m 59s. Estimated total time: 45h 45m 13s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 30s, 500 more iterations: 7h 37m 32s. [2025-11-13 00:37:56,288][__main__][INFO] - Starting iteration 169. [2025-11-13 00:37:56,754][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 00:37:56,755][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:38:02,060][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:38:11,227][__main__][INFO] - Number of regex retries in iteration 169: 1 [2025-11-13 00:38:11,228][__main__][INFO] - agents played in iteration 169 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:38:12,042][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:38:12,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:38:12,095][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:38:12,117][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:38:12,118][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:38:12,119][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:38:12,770][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:38:13,243][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:38:13,753][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:38:14,257][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:38:14,768][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:38:15,273][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:38:15,783][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:38:16,286][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:38:16,790][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:38:17,297][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:38:17,805][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:38:18,309][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:38:18,815][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:38:19,319][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:38:19,826][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:38:20,329][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:38:20,832][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:38:21,349][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:38:21,852][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:38:22,363][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:38:22,865][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:38:23,370][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:38:23,880][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:38:24,383][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:38:24,887][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:38:25,390][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:38:25,895][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:38:26,402][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:38:26,907][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:38:27,412][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:38:27,917][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:38:28,422][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:38:28,930][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:38:29,434][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:38:29,941][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:38:30,454][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:38:30,959][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:38:31,472][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:38:31,979][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:38:32,485][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:38:32,992][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:38:33,496][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:38:34,040][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:38:34,553][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:38:35,061][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:38:35,574][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:38:36,085][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:38:36,590][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:38:37,096][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:38:37,600][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:38:38,103][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:38:38,607][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:38:39,112][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:38:39,617][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:38:40,120][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:38:40,623][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:38:41,137][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:38:41,641][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:38:42,159][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:38:42,664][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:38:43,171][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:38:43,680][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:38:44,186][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:38:44,693][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:38:45,200][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10861 tokens. [2025-11-13 00:38:45,917][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.28%, Current % of VRAM taken: 58.53%, Block Peak % of device VRAM: 62.39%, ΔTime: 00:00:33 [2025-11-13 00:38:46,689][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:38:46,692][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:38:46,694][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:38:47,645][__main__][INFO] - Iteration 170 took 50s (28.44% Gen, 69.69% Train). Generation: 14s, Training: 35s. Estimated remaining time: 39h 55m 27s. Estimated total time: 42h 24m 33s. Time estimates for 10 more iterations: 8m 28s, 100 more iterations: 1h 24m 49s, 500 more iterations: 7h 4m 5s. [2025-11-13 00:38:47,647][__main__][INFO] - Starting iteration 170. [2025-11-13 00:38:48,150][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 16 and human policies 1. [2025-11-13 00:38:48,151][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:39:03,605][__main__][INFO] - Number of regex retries in iteration 170: 0 [2025-11-13 00:39:03,606][__main__][INFO] - agents played in iteration 170 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:39:04,388][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:39:04,416][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:39:04,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:39:04,464][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:39:04,464][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:39:04,466][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:39:05,076][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:39:05,551][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:39:06,063][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:39:06,571][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:39:07,076][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:39:07,583][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:39:08,089][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:39:08,591][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:39:09,094][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:39:09,599][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:39:10,102][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:39:10,608][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:39:11,112][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:39:11,615][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:39:12,118][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:39:12,619][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:39:13,120][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:39:13,626][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:39:14,129][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:39:14,634][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:39:15,134][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:39:15,636][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:39:16,167][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:39:16,669][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:39:17,174][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:39:17,678][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:39:18,179][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:39:18,689][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:39:19,192][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:39:19,692][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:39:20,195][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:39:20,699][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:39:21,208][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:39:21,733][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:39:22,243][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:39:22,756][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:39:23,261][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:39:23,780][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:39:24,284][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:39:24,789][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:39:25,291][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:39:25,801][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:39:26,310][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:39:26,816][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:39:27,323][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:39:27,831][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:39:28,335][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:39:28,848][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:39:29,356][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:39:29,862][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:39:30,371][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:39:30,874][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:39:31,382][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:39:31,882][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:39:32,384][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:39:32,891][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:39:33,395][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:39:33,898][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:39:34,401][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:39:34,906][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:39:35,411][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:39:35,914][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:39:36,418][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:39:36,925][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:39:37,431][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10861 tokens. [2025-11-13 00:39:38,105][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 00:39:38,861][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:39:38,863][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:39:38,864][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:39:40,727][__main__][INFO] - Iteration 171 took 52s (29.40% Gen, 67.06% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 18m 52s. Estimated total time: 43h 48m 51s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 37s, 500 more iterations: 7h 18m 8s. [2025-11-13 00:39:40,729][__main__][INFO] - Starting iteration 171. [2025-11-13 00:39:41,214][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 00:39:41,215][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:39:57,894][__main__][INFO] - Number of regex retries in iteration 171: 0 [2025-11-13 00:39:57,895][__main__][INFO] - agents played in iteration 171 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:39:58,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:39:58,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:39:58,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:39:58,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:39:58,837][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:39:58,837][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:39:59,539][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:40:00,002][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:40:00,515][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:40:01,017][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:40:01,520][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:40:02,026][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:40:02,531][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:40:03,038][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:40:03,542][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:40:04,048][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:40:04,563][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:40:05,064][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:40:05,580][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:40:06,082][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:40:06,583][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:40:07,083][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:40:07,585][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:40:08,089][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:40:08,593][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:40:09,097][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:40:09,605][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:40:10,110][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:40:10,617][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:40:11,123][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:40:11,629][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:40:12,139][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:40:12,644][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:40:13,151][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:40:13,665][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:40:14,173][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:40:14,683][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:40:15,190][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:40:15,696][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:40:16,204][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:40:16,708][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:40:17,212][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:40:17,719][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:40:18,222][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:40:18,732][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:40:19,236][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:40:19,743][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:40:20,259][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:40:20,763][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:40:21,281][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:40:21,784][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:40:22,292][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:40:22,802][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:40:23,303][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:40:23,806][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:40:24,306][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:40:24,809][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:40:25,312][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:40:25,815][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:40:26,316][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:40:26,818][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:40:27,320][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:40:27,826][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:40:28,331][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:40:28,834][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:40:29,337][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:40:29,838][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:40:30,339][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:40:30,841][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:40:31,344][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:40:31,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10857 tokens. [2025-11-13 00:40:32,552][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.26%, Current % of VRAM taken: 58.50%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:33 [2025-11-13 00:40:33,320][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:40:33,322][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:40:33,323][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:40:34,284][__main__][INFO] - Iteration 172 took 53s (31.43% Gen, 66.76% Train). Generation: 16s, Training: 35s. Estimated remaining time: 41h 42m 37s. Estimated total time: 44h 13m 29s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 26s, 500 more iterations: 7h 22m 14s. [2025-11-13 00:40:34,286][__main__][INFO] - Starting iteration 172. [2025-11-13 00:40:34,757][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 00:40:34,758][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:40:50,420][__main__][INFO] - Number of regex retries in iteration 172: 0 [2025-11-13 00:40:50,421][__main__][INFO] - agents played in iteration 172 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:40:51,198][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:40:51,232][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:40:51,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:40:51,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:40:51,278][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:40:51,279][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:40:51,928][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:40:52,390][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:40:52,907][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:40:53,410][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:40:53,912][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:40:54,418][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:40:54,922][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:40:55,425][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:40:55,929][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:40:56,438][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:40:56,946][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:40:57,451][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:40:57,966][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:40:58,476][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:40:58,978][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:40:59,481][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:40:59,986][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:41:00,491][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:41:00,999][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:41:01,509][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:41:02,020][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:41:02,529][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:41:03,035][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:41:03,542][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:41:04,048][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:41:04,553][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:41:05,073][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:41:05,579][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:41:06,094][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:41:06,599][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:41:07,102][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:41:07,606][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:41:08,112][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:41:08,617][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:41:09,126][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:41:09,654][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:41:10,162][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:41:10,670][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:41:11,179][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:41:11,688][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:41:12,196][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:41:12,712][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:41:13,217][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:41:13,724][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:41:14,230][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:41:14,733][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:41:15,240][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:41:15,743][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:41:16,249][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:41:16,755][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:41:17,260][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:41:17,764][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:41:18,269][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:41:18,773][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:41:19,293][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:41:19,799][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:41:20,309][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:41:20,813][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:41:21,316][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:41:21,824][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:41:22,328][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:41:22,830][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:41:23,335][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:41:23,837][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:41:24,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10849 tokens. [2025-11-13 00:41:25,019][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.26%, Current % of VRAM taken: 58.51%, Block Peak % of device VRAM: 62.47%, ΔTime: 00:00:33 [2025-11-13 00:41:25,800][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:41:25,801][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:41:25,803][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:41:26,743][__main__][INFO] - Iteration 173 took 51s (30.13% Gen, 68.06% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 47m 33s. Estimated total time: 43h 19m 19s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 38s, 500 more iterations: 7h 13m 13s. [2025-11-13 00:41:26,745][__main__][INFO] - Starting iteration 173. [2025-11-13 00:41:27,235][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 00:41:27,235][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:41:33,738][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:41:43,280][__main__][INFO] - Number of regex retries in iteration 173: 1 [2025-11-13 00:41:43,280][__main__][INFO] - agents played in iteration 173 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:41:44,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:41:44,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:41:44,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:41:44,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:41:44,129][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:41:44,131][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:41:44,739][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:41:45,196][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:41:45,705][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:41:46,209][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:41:46,719][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:41:47,224][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:41:47,730][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:41:48,236][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:41:48,738][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:41:49,244][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:41:49,749][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:41:50,255][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:41:50,767][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:41:51,272][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:41:51,777][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:41:52,282][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:41:52,786][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:41:53,313][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:41:53,819][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:41:54,323][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:41:54,827][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:41:55,332][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:41:55,839][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:41:56,344][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:41:56,851][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:41:57,358][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:41:57,864][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:41:58,371][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:41:58,874][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:41:59,378][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:41:59,905][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:42:00,411][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:42:00,919][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:42:01,421][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:42:01,927][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:42:02,432][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:42:02,935][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:42:03,441][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:42:03,946][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:42:04,452][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:42:04,961][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:42:05,470][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:42:05,977][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:42:06,485][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:42:06,990][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:42:07,496][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:42:08,001][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:42:08,507][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:42:09,022][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:42:09,547][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:42:10,053][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:42:10,557][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:42:11,062][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:42:11,571][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:42:12,072][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:42:12,576][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:42:13,092][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:42:13,596][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:42:14,109][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:42:14,612][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:42:15,116][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:42:15,619][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:42:16,123][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:42:16,629][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:42:17,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10872 tokens. [2025-11-13 00:42:17,815][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.28%, Current % of VRAM taken: 58.52%, Block Peak % of device VRAM: 62.36%, ΔTime: 00:00:33 [2025-11-13 00:42:18,605][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:42:18,606][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:42:18,608][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:42:19,554][__main__][INFO] - Iteration 174 took 52s (30.67% Gen, 67.52% Train). Generation: 16s, Training: 35s. Estimated remaining time: 41h 3m 20s. Estimated total time: 43h 35m 58s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 11s, 500 more iterations: 7h 15m 59s. [2025-11-13 00:42:19,556][__main__][INFO] - Starting iteration 174. [2025-11-13 00:42:20,037][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 00:42:20,037][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:42:24,902][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:42:36,839][__main__][INFO] - Number of regex retries in iteration 174: 1 [2025-11-13 00:42:36,840][__main__][INFO] - agents played in iteration 174 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:42:37,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:42:37,717][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:42:37,754][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:42:37,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:42:37,777][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:42:37,778][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:42:38,396][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:42:38,859][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:42:39,371][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:42:39,876][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:42:40,386][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:42:40,892][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:42:41,396][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:42:41,920][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:42:42,429][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:42:42,936][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:42:43,443][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:42:43,946][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:42:44,454][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:42:44,959][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:42:45,465][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:42:45,971][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:42:46,476][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:42:46,988][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:42:47,495][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:42:47,999][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:42:48,516][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:42:49,018][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:42:49,518][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:42:50,019][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:42:50,524][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:42:51,036][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:42:51,541][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:42:52,046][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:42:52,550][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:42:53,055][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:42:53,566][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:42:54,071][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:42:54,578][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:42:55,085][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:42:55,590][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:42:56,095][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:42:56,598][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:42:57,102][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:42:57,622][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:42:58,132][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:42:58,641][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:42:59,144][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:42:59,652][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:43:00,163][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:43:00,670][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:43:01,176][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:43:01,685][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:43:02,189][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:43:02,697][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:43:03,201][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:43:03,704][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:43:04,208][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:43:04,710][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:43:05,213][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:43:05,719][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:43:06,222][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:43:06,733][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:43:07,238][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:43:07,736][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:43:08,238][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:43:08,739][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:43:09,240][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:43:09,743][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:43:10,248][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:43:10,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10837 tokens. [2025-11-13 00:43:11,396][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 00:43:12,159][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:43:12,160][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:43:12,162][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:43:13,178][__main__][INFO] - Iteration 175 took 53s (31.62% Gen, 66.47% Train). Generation: 16s, Training: 35s. Estimated remaining time: 41h 43m 35s. Estimated total time: 44h 17m 7s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 34s, 500 more iterations: 7h 22m 51s. [2025-11-13 00:43:13,181][__main__][INFO] - Starting iteration 175. [2025-11-13 00:43:13,646][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 00:43:13,647][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:43:19,443][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:43:29,346][__main__][INFO] - Number of regex retries in iteration 175: 1 [2025-11-13 00:43:29,347][__main__][INFO] - agents played in iteration 175 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:43:30,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:43:30,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:43:30,194][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:43:30,217][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:43:30,217][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:43:30,218][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:43:30,860][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:43:31,318][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:43:31,841][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:43:32,345][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:43:32,852][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:43:33,358][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:43:33,862][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:43:34,370][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:43:34,876][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:43:35,379][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:43:35,886][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:43:36,391][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:43:36,902][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:43:37,406][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:43:37,910][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:43:38,417][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:43:38,922][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:43:39,430][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:43:39,942][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:43:40,448][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:43:40,958][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:43:41,466][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:43:41,975][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:43:42,478][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:43:42,985][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:43:43,493][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:43:43,997][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:43:44,504][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:43:45,014][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:43:45,516][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:43:46,029][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:43:46,533][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:43:47,036][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:43:47,545][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:43:48,052][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:43:48,556][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:43:49,062][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:43:49,564][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:43:50,071][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:43:50,577][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:43:51,083][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:43:51,594][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:43:52,098][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:43:52,607][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:43:53,110][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:43:53,612][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:43:54,116][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:43:54,617][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:43:55,119][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:43:55,641][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:43:56,144][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:43:56,650][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:43:57,155][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:43:57,658][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:43:58,162][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:43:58,668][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:43:59,174][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:43:59,678][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:44:00,181][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:44:00,682][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:44:01,186][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:44:01,692][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:44:02,196][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:44:02,697][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:44:03,199][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10860 tokens. [2025-11-13 00:44:03,864][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.26%, Current % of VRAM taken: 58.51%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 00:44:04,635][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:44:04,637][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:44:04,639][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:44:05,577][__main__][INFO] - Iteration 176 took 51s (30.23% Gen, 67.96% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 42m 11s. Estimated total time: 43h 16m 34s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 33s, 500 more iterations: 7h 12m 45s. [2025-11-13 00:44:05,579][__main__][INFO] - Starting iteration 176. [2025-11-13 00:44:06,080][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 00:44:06,081][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:44:14,280][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:44:21,017][__main__][INFO] - Number of regex retries in iteration 176: 1 [2025-11-13 00:44:21,018][__main__][INFO] - agents played in iteration 176 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:44:21,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:44:21,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:44:21,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:44:21,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:44:21,867][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:44:21,868][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:44:22,482][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:44:22,943][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:44:23,449][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:44:23,955][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:44:24,461][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:44:24,970][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:44:25,476][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:44:25,981][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:44:26,483][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:44:26,989][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:44:27,493][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:44:27,999][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:44:28,503][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:44:29,008][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:44:29,512][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:44:30,024][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:44:30,526][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:44:31,046][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:44:31,550][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:44:32,055][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:44:32,560][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:44:33,066][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:44:33,573][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:44:34,076][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:44:34,581][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:44:35,090][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:44:35,593][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:44:36,099][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:44:36,602][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:44:37,108][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:44:37,634][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:44:38,139][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:44:38,648][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:44:39,151][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:44:39,658][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:44:40,170][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:44:40,677][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:44:41,185][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:44:41,690][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:44:42,194][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:44:42,699][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:44:43,204][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:44:43,710][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:44:44,216][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:44:44,718][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:44:45,222][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:44:45,725][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:44:46,235][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:44:46,754][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:44:47,260][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:44:47,763][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:44:48,268][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:44:48,773][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:44:49,281][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:44:49,787][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:44:50,289][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:44:50,795][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:44:51,297][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:44:51,798][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:44:52,300][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:44:52,801][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:44:53,311][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:44:53,814][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:44:54,317][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:44:54,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10856 tokens. [2025-11-13 00:44:55,487][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 00:44:56,282][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:44:56,284][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:44:56,286][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:44:57,275][__main__][INFO] - Iteration 177 took 51s (29.18% Gen, 68.89% Train). Generation: 14s, Training: 35s. Estimated remaining time: 40h 4m 32s. Estimated total time: 42h 39m 47s. Time estimates for 10 more iterations: 8m 31s, 100 more iterations: 1h 25m 19s, 500 more iterations: 7h 6m 37s. [2025-11-13 00:44:57,278][__main__][INFO] - Starting iteration 177. [2025-11-13 00:44:57,766][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 00:44:57,766][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:45:13,705][__main__][INFO] - Number of regex retries in iteration 177: 0 [2025-11-13 00:45:13,705][__main__][INFO] - agents played in iteration 177 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:45:14,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:45:14,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:45:14,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:45:14,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:45:14,640][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:45:14,641][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:45:15,299][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:45:15,775][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:45:16,286][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:45:16,793][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:45:17,300][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:45:17,806][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:45:18,330][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:45:18,838][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:45:19,343][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:45:19,850][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:45:20,356][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:45:20,864][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:45:21,371][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:45:21,879][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:45:22,386][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:45:22,891][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:45:23,399][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:45:23,905][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:45:24,410][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:45:24,930][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:45:25,435][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:45:25,941][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:45:26,443][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:45:26,951][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:45:27,460][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:45:27,966][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:45:28,470][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:45:28,977][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:45:29,480][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:45:29,984][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:45:30,485][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:45:30,988][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:45:31,493][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:45:32,004][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:45:32,513][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:45:33,019][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:45:33,526][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:45:34,044][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:45:34,550][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:45:35,056][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:45:35,561][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:45:36,068][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:45:36,574][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:45:37,080][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:45:37,584][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:45:38,088][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:45:38,591][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:45:39,098][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:45:39,606][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:45:40,111][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:45:40,613][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:45:41,114][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:45:41,614][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:45:42,125][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:45:42,626][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:45:43,147][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:45:43,649][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:45:44,152][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:45:44,655][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:45:45,159][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:45:45,661][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:45:46,164][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:45:46,669][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:45:47,174][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:45:47,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10863 tokens. [2025-11-13 00:45:48,334][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.29%, Current % of VRAM taken: 58.53%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:33 [2025-11-13 00:45:49,110][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:45:49,111][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:45:49,113][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:45:50,036][__main__][INFO] - Iteration 178 took 52s (30.49% Gen, 67.74% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 57m 23s. Estimated total time: 43h 33m 32s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 7s, 500 more iterations: 7h 15m 35s. [2025-11-13 00:45:50,038][__main__][INFO] - Starting iteration 178. [2025-11-13 00:45:50,506][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 00:45:50,507][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:45:56,195][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:46:05,836][__main__][INFO] - Number of regex retries in iteration 178: 1 [2025-11-13 00:46:05,837][__main__][INFO] - agents played in iteration 178 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:46:06,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:46:06,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:46:06,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:46:06,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:46:06,686][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:46:06,687][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:46:07,368][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:46:07,832][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:46:08,353][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:46:08,859][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:46:09,367][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:46:09,878][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:46:10,390][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:46:10,898][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:46:11,404][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:46:11,911][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:46:12,418][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:46:12,924][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:46:13,432][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:46:13,950][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:46:14,459][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:46:14,973][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:46:15,479][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:46:15,984][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:46:16,492][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:46:16,999][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:46:17,504][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:46:18,009][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:46:18,517][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:46:19,026][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:46:19,532][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:46:20,036][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:46:20,541][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:46:21,048][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:46:21,560][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:46:22,064][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:46:22,568][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:46:23,077][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:46:23,583][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:46:24,088][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:46:24,592][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:46:25,094][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:46:25,598][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:46:26,101][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:46:26,605][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:46:27,112][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:46:27,614][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:46:28,119][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:46:28,622][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:46:29,127][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:46:29,636][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:46:30,141][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:46:30,649][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:46:31,154][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:46:31,657][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:46:32,172][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:46:32,678][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:46:33,180][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:46:33,683][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:46:34,183][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:46:34,684][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:46:35,185][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:46:35,685][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:46:36,189][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:46:36,688][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:46:37,188][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:46:37,690][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:46:38,191][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:46:38,695][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:46:39,196][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:46:39,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10867 tokens. [2025-11-13 00:46:40,371][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 00:46:41,138][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:46:41,139][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:46:41,141][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:46:42,083][__main__][INFO] - Iteration 179 took 51s (29.72% Gen, 68.45% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 21m 52s. Estimated total time: 42h 58m 53s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 57s, 500 more iterations: 7h 9m 48s. [2025-11-13 00:46:42,086][__main__][INFO] - Starting iteration 179. [2025-11-13 00:46:42,569][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 00:46:42,570][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:46:47,624][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:46:57,792][__main__][INFO] - Number of regex retries in iteration 179: 1 [2025-11-13 00:46:57,793][__main__][INFO] - agents played in iteration 179 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:46:58,598][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:46:58,626][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:46:58,652][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:46:58,675][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:46:58,676][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:46:58,677][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:46:59,329][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:46:59,795][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:47:00,305][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:47:00,811][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:47:01,321][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:47:01,826][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:47:02,332][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:47:02,851][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:47:03,363][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:47:03,879][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:47:04,386][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:47:04,893][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:47:05,405][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:47:05,912][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:47:06,419][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:47:06,925][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:47:07,432][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:47:07,956][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:47:08,463][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:47:08,974][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:47:09,481][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:47:09,989][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:47:10,498][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:47:11,007][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:47:11,514][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:47:12,020][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:47:12,524][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:47:13,026][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:47:13,532][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:47:14,036][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:47:14,555][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:47:15,059][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:47:15,560][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:47:16,068][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:47:16,575][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:47:17,083][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:47:17,589][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:47:18,094][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:47:18,598][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:47:19,103][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:47:19,608][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:47:20,119][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:47:20,625][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:47:21,138][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:47:21,649][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:47:22,159][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:47:22,667][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:47:23,174][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:47:23,683][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:47:24,189][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:47:24,694][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:47:25,196][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:47:25,694][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:47:26,196][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:47:26,697][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:47:27,200][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:47:27,700][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:47:28,202][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:47:28,706][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:47:29,209][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:47:29,712][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:47:30,216][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:47:30,718][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:47:31,220][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:47:31,722][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10856 tokens. [2025-11-13 00:47:32,396][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 00:47:33,175][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:47:33,177][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:47:33,179][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:47:34,194][__main__][INFO] - Iteration 180 took 51s (29.49% Gen, 68.54% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 23m 24s. Estimated total time: 43h 1m 16s. Time estimates for 10 more iterations: 8m 36s, 100 more iterations: 1h 26m 2s, 500 more iterations: 7h 10m 12s. [2025-11-13 00:47:34,196][__main__][INFO] - Starting iteration 180. [2025-11-13 00:47:34,700][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 17 and human policies 1. [2025-11-13 00:47:34,701][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:47:38,782][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:47:40,620][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:47:43,007][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:47:50,562][__main__][INFO] - Number of regex retries in iteration 180: 3 [2025-11-13 00:47:50,563][__main__][INFO] - agents played in iteration 180 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:47:51,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:47:51,446][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:47:51,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:47:51,495][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:47:51,495][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:47:51,496][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:47:52,163][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:47:52,625][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:47:53,133][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:47:53,646][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:47:54,151][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:47:54,655][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:47:55,164][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:47:55,670][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:47:56,183][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:47:56,690][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:47:57,196][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:47:57,702][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:47:58,207][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:47:58,711][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:47:59,216][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:47:59,721][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:48:00,226][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:48:00,736][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:48:01,242][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:48:01,763][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:48:02,269][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:48:02,786][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:48:03,291][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:48:03,796][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:48:04,301][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:48:04,805][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:48:05,310][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:48:05,815][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:48:06,317][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:48:06,824][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:48:07,330][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:48:07,835][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:48:08,340][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:48:08,847][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:48:09,369][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:48:09,876][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:48:10,382][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:48:10,890][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:48:11,394][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:48:11,904][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:48:12,410][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:48:12,916][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:48:13,420][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:48:13,923][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:48:14,432][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:48:14,939][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:48:15,444][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:48:15,948][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:48:16,451][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:48:16,961][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:48:17,463][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:48:17,969][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:48:18,475][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:48:18,974][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:48:19,475][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:48:19,977][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:48:20,478][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:48:20,993][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:48:21,494][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:48:21,996][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:48:22,497][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:48:22,997][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:48:23,502][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:48:24,001][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:48:24,504][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10878 tokens. [2025-11-13 00:48:25,154][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:32 [2025-11-13 00:48:25,921][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:48:25,923][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:48:25,924][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:48:27,659][__main__][INFO] - Iteration 181 took 52s (29.95% Gen, 66.77% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 29m 11s. Estimated total time: 44h 7m 57s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 15s, 500 more iterations: 7h 21m 19s. [2025-11-13 00:48:27,661][__main__][INFO] - Starting iteration 181. [2025-11-13 00:48:28,134][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 00:48:28,134][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:48:43,014][__main__][INFO] - Number of regex retries in iteration 181: 0 [2025-11-13 00:48:43,015][__main__][INFO] - agents played in iteration 181 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:48:43,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:48:43,835][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:48:43,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:48:43,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:48:43,881][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:48:43,881][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:48:44,540][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:48:45,003][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:48:45,513][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:48:46,020][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:48:46,526][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:48:47,032][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:48:47,539][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:48:48,045][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:48:48,555][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:48:49,059][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:48:49,565][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:48:50,073][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:48:50,578][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:48:51,086][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:48:51,592][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:48:52,099][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:48:52,607][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:48:53,112][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:48:53,617][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:48:54,122][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:48:54,628][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:48:55,136][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:48:55,640][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:48:56,145][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:48:56,651][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:48:57,156][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:48:57,682][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:48:58,187][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:48:58,695][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:48:59,201][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:48:59,709][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:49:00,213][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:49:00,716][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:49:01,219][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:49:01,724][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:49:02,229][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:49:02,733][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:49:03,239][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:49:03,741][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:49:04,246][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:49:04,752][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:49:05,258][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:49:05,776][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:49:06,281][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:49:06,793][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:49:07,297][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:49:07,801][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:49:08,304][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:49:08,809][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:49:09,313][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:49:09,818][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:49:10,325][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:49:10,829][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:49:11,333][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:49:11,835][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:49:12,342][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:49:12,846][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:49:13,351][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:49:13,854][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:49:14,358][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:49:14,872][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:49:15,373][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:49:15,875][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:49:16,380][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:49:16,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10855 tokens. [2025-11-13 00:49:17,543][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 00:49:18,308][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:49:18,309][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:49:18,311][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:49:19,257][__main__][INFO] - Iteration 182 took 51s (29.11% Gen, 69.04% Train). Generation: 14s, Training: 35s. Estimated remaining time: 39h 56m 32s. Estimated total time: 42h 36m 10s. Time estimates for 10 more iterations: 8m 31s, 100 more iterations: 1h 25m 12s, 500 more iterations: 7h 6m 1s. [2025-11-13 00:49:19,259][__main__][INFO] - Starting iteration 182. [2025-11-13 00:49:19,744][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 00:49:19,744][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:49:28,372][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:49:36,419][__main__][INFO] - Number of regex retries in iteration 182: 1 [2025-11-13 00:49:36,420][__main__][INFO] - agents played in iteration 182 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:49:37,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:49:37,240][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:49:37,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:49:37,286][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:49:37,287][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:49:37,288][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:49:37,917][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:49:38,392][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:49:38,905][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:49:39,410][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:49:39,916][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:49:40,420][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:49:40,928][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:49:41,434][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:49:41,940][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:49:42,448][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:49:42,957][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:49:43,462][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:49:43,981][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:49:44,487][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:49:45,009][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:49:45,514][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:49:46,021][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:49:46,525][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:49:47,030][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:49:47,534][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:49:48,041][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:49:48,546][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:49:49,052][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:49:49,557][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:49:50,060][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:49:50,569][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:49:51,075][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:49:51,592][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:49:52,095][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:49:52,599][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:49:53,104][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:49:53,607][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:49:54,110][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:49:54,616][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:49:55,120][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:49:55,632][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:49:56,139][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:49:56,643][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:49:57,150][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:49:57,655][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:49:58,165][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:49:58,668][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:49:59,172][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:49:59,680][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:50:00,183][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:50:00,686][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:50:01,189][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:50:01,692][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:50:02,203][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:50:02,710][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:50:03,215][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:50:03,722][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:50:04,224][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:50:04,735][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:50:05,237][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:50:05,737][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:50:06,244][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:50:06,747][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:50:07,251][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:50:07,754][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:50:08,257][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:50:08,762][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:50:09,265][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:50:09,768][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:50:10,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10846 tokens. [2025-11-13 00:50:10,923][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.27%, Current % of VRAM taken: 58.51%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:33 [2025-11-13 00:50:11,710][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:50:11,712][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:50:11,714][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:50:12,664][__main__][INFO] - Iteration 183 took 52s (31.51% Gen, 66.69% Train). Generation: 16s, Training: 35s. Estimated remaining time: 41h 25m 31s. Estimated total time: 44h 6m 2s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 12s, 500 more iterations: 7h 21m 0s. [2025-11-13 00:50:12,666][__main__][INFO] - Starting iteration 183. [2025-11-13 00:50:13,147][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 00:50:13,148][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:50:17,750][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:50:22,828][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:50:29,020][__main__][INFO] - Number of regex retries in iteration 183: 2 [2025-11-13 00:50:29,021][__main__][INFO] - agents played in iteration 183 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:50:29,859][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:50:29,884][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:50:29,911][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:50:29,936][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:50:29,937][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:50:29,938][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:50:30,589][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:50:31,054][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:50:31,562][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:50:32,066][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:50:32,573][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:50:33,076][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:50:33,578][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:50:34,084][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:50:34,586][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:50:35,096][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:50:35,612][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:50:36,119][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:50:36,645][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:50:37,154][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:50:37,661][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:50:38,168][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:50:38,679][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:50:39,189][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:50:39,695][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:50:40,200][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:50:40,718][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:50:41,224][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:50:41,736][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:50:42,243][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:50:42,750][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:50:43,259][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:50:43,766][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:50:44,273][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:50:44,781][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:50:45,286][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:50:45,797][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:50:46,301][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:50:46,802][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:50:47,318][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:50:47,820][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:50:48,339][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:50:48,841][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:50:49,346][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:50:49,855][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:50:50,361][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:50:50,867][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:50:51,371][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:50:51,873][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:50:52,376][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:50:52,881][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:50:53,385][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:50:53,884][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:50:54,387][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:50:54,888][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:50:55,391][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:50:55,894][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:50:56,399][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:50:56,904][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:50:57,422][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:50:57,925][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:50:58,426][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:50:58,931][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:50:59,434][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:50:59,942][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:51:00,445][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:51:00,946][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:51:01,447][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:51:01,949][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:51:02,453][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:51:02,954][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10852 tokens. [2025-11-13 00:51:03,601][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.27%, Current % of VRAM taken: 58.51%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:33 [2025-11-13 00:51:04,374][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:51:04,376][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:51:04,377][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:51:05,291][__main__][INFO] - Iteration 184 took 52s (30.44% Gen, 67.80% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 45m 51s. Estimated total time: 43h 27m 14s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 54s, 500 more iterations: 7h 14m 32s. [2025-11-13 00:51:05,293][__main__][INFO] - Starting iteration 184. [2025-11-13 00:51:05,761][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 00:51:05,761][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:51:20,334][__main__][INFO] - Number of regex retries in iteration 184: 0 [2025-11-13 00:51:20,335][__main__][INFO] - agents played in iteration 184 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:51:21,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:51:21,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:51:21,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:51:21,215][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:51:21,215][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:51:21,216][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:51:21,893][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:51:22,366][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:51:22,875][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:51:23,382][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:51:23,885][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:51:24,393][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:51:24,899][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:51:25,404][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:51:25,912][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:51:26,424][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:51:26,933][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:51:27,444][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:51:27,953][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:51:28,456][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:51:28,965][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:51:29,469][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:51:29,974][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:51:30,479][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:51:30,983][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:51:31,486][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:51:31,993][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:51:32,497][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:51:33,011][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:51:33,518][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:51:34,040][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:51:34,547][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:51:35,052][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:51:35,558][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:51:36,065][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:51:36,570][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:51:37,075][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:51:37,578][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:51:38,086][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:51:38,591][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:51:39,096][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:51:39,600][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:51:40,106][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:51:40,619][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:51:41,121][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:51:41,621][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:51:42,124][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:51:42,630][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:51:43,133][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:51:43,637][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:51:44,141][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:51:44,644][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:51:45,150][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:51:45,653][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:51:46,156][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:51:46,660][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:51:47,171][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:51:47,676][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:51:48,179][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:51:48,684][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:51:49,187][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:51:49,693][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:51:50,199][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:51:50,703][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:51:51,206][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:51:51,708][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:51:52,211][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:51:52,715][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:51:53,216][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:51:53,723][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:51:54,225][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10860 tokens. [2025-11-13 00:51:54,888][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:33 [2025-11-13 00:51:55,662][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:51:55,664][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:51:55,665][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:51:56,602][__main__][INFO] - Iteration 185 took 50s (28.67% Gen, 69.49% Train). Generation: 14s, Training: 35s. Estimated remaining time: 39h 39m 52s. Estimated total time: 42h 22m 6s. Time estimates for 10 more iterations: 8m 28s, 100 more iterations: 1h 24m 44s, 500 more iterations: 7h 3m 41s. [2025-11-13 00:51:56,605][__main__][INFO] - Starting iteration 185. [2025-11-13 00:51:57,111][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 00:51:57,112][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:52:01,798][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:52:13,853][__main__][INFO] - Number of regex retries in iteration 185: 1 [2025-11-13 00:52:13,853][__main__][INFO] - agents played in iteration 185 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:52:14,661][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:52:14,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:52:14,706][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:52:14,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:52:14,729][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:52:14,730][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:52:15,407][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:52:15,871][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:52:16,377][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:52:16,881][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:52:17,389][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:52:17,893][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:52:18,395][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:52:18,902][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:52:19,403][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:52:19,910][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:52:20,415][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:52:20,917][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:52:21,430][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:52:21,934][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:52:22,440][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:52:22,955][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:52:23,462][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:52:23,966][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:52:24,473][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:52:24,977][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:52:25,481][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:52:25,986][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:52:26,493][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:52:26,996][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:52:27,499][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:52:28,005][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:52:28,512][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:52:29,018][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:52:29,522][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:52:30,023][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:52:30,528][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:52:31,031][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:52:31,534][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:52:32,050][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:52:32,555][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:52:33,058][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:52:33,563][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:52:34,066][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:52:34,580][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:52:35,084][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:52:35,585][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:52:36,089][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:52:36,592][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:52:37,095][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:52:37,599][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:52:38,102][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:52:38,605][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:52:39,106][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:52:39,612][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:52:40,115][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:52:40,616][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:52:41,126][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:52:41,632][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:52:42,140][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:52:42,653][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:52:43,155][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:52:43,668][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:52:44,172][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:52:44,674][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:52:45,178][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:52:45,681][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:52:46,186][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:52:46,689][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:52:47,192][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:52:47,697][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10862 tokens. [2025-11-13 00:52:48,362][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.26%, Current % of VRAM taken: 58.51%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:32 [2025-11-13 00:52:49,144][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:52:49,146][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:52:49,148][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:52:50,078][__main__][INFO] - Iteration 186 took 52s (31.61% Gen, 66.63% Train). Generation: 16s, Training: 35s. Estimated remaining time: 41h 25m 14s. Estimated total time: 44h 8m 22s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 16s, 500 more iterations: 7h 21m 23s. [2025-11-13 00:52:50,080][__main__][INFO] - Starting iteration 186. [2025-11-13 00:52:50,563][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 00:52:50,564][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:53:05,570][__main__][INFO] - Number of regex retries in iteration 186: 0 [2025-11-13 00:53:05,571][__main__][INFO] - agents played in iteration 186 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:53:06,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:53:06,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:53:06,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:53:06,475][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:53:06,476][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:53:06,476][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:53:07,155][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:53:07,615][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:53:08,125][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:53:08,629][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:53:09,134][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:53:09,637][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:53:10,152][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:53:10,657][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:53:11,160][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:53:11,668][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:53:12,172][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:53:12,677][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:53:13,182][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:53:13,690][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:53:14,193][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:53:14,700][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:53:15,209][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:53:15,715][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:53:16,220][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:53:16,725][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:53:17,231][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:53:17,736][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:53:18,242][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:53:18,746][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:53:19,252][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:53:19,756][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:53:20,261][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:53:20,773][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:53:21,278][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:53:21,791][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:53:22,297][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:53:22,801][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:53:23,304][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:53:23,806][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:53:24,307][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:53:24,809][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:53:25,313][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:53:25,817][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:53:26,319][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:53:26,821][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:53:27,325][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:53:27,824][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:53:28,326][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:53:28,831][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:53:29,333][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:53:29,836][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:53:30,338][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:53:30,837][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:53:31,340][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:53:31,842][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:53:32,346][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:53:32,846][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:53:33,347][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:53:33,862][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:53:34,367][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:53:34,886][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:53:35,386][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:53:35,887][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:53:36,397][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:53:36,897][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:53:37,402][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:53:37,902][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:53:38,403][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:53:38,906][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:53:39,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10857 tokens. [2025-11-13 00:53:40,071][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:32 [2025-11-13 00:53:40,827][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:53:40,828][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:53:40,830][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:53:41,762][__main__][INFO] - Iteration 187 took 51s (29.31% Gen, 68.86% Train). Generation: 15s, Training: 35s. Estimated remaining time: 39h 55m 59s. Estimated total time: 42h 39m 59s. Time estimates for 10 more iterations: 8m 31s, 100 more iterations: 1h 25m 19s, 500 more iterations: 7h 6m 39s. [2025-11-13 00:53:41,764][__main__][INFO] - Starting iteration 187. [2025-11-13 00:53:42,232][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 00:53:42,233][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:53:58,234][__main__][INFO] - Number of regex retries in iteration 187: 0 [2025-11-13 00:53:58,235][__main__][INFO] - agents played in iteration 187 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:53:59,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:53:59,055][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:53:59,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:53:59,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:53:59,101][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:53:59,101][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:53:59,762][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:54:00,222][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:54:00,730][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:54:01,233][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:54:01,736][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:54:02,246][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:54:02,753][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:54:03,257][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:54:03,769][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:54:04,275][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:54:04,789][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:54:05,295][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:54:05,799][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:54:06,306][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:54:06,809][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:54:07,315][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:54:07,818][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:54:08,324][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:54:08,830][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:54:09,341][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:54:09,850][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:54:10,351][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:54:10,859][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:54:11,387][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:54:11,894][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:54:12,399][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:54:12,908][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:54:13,413][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:54:13,918][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:54:14,421][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:54:14,924][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:54:15,426][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:54:15,933][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:54:16,435][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:54:16,935][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:54:17,438][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:54:17,941][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:54:18,443][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:54:18,945][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:54:19,447][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:54:19,948][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:54:20,454][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:54:20,957][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:54:21,459][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:54:21,975][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:54:22,478][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:54:22,998][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:54:23,499][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:54:23,999][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:54:24,503][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:54:25,006][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:54:25,512][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:54:26,015][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:54:26,517][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:54:27,022][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:54:27,527][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:54:28,029][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:54:28,532][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:54:29,034][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:54:29,538][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:54:30,039][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:54:30,542][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:54:31,045][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:54:31,548][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:54:32,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10830 tokens. [2025-11-13 00:54:32,709][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.33%, ΔTime: 00:00:32 [2025-11-13 00:54:33,480][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:54:33,481][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:54:33,483][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:54:34,405][__main__][INFO] - Iteration 188 took 52s (30.67% Gen, 67.56% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 43m 47s. Estimated total time: 43h 28m 39s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 57s, 500 more iterations: 7h 14m 46s. [2025-11-13 00:54:34,407][__main__][INFO] - Starting iteration 188. [2025-11-13 00:54:34,905][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 00:54:34,906][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:54:41,389][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:54:51,009][__main__][INFO] - Number of regex retries in iteration 188: 1 [2025-11-13 00:54:51,010][__main__][INFO] - agents played in iteration 188 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:54:51,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:54:51,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:54:51,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:54:51,888][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:54:51,889][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:54:51,889][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:54:52,530][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:54:52,991][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:54:53,503][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:54:54,009][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:54:54,516][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:54:55,020][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:54:55,523][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:54:56,030][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:54:56,537][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:54:57,041][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:54:57,544][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:54:58,049][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:54:58,553][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:54:59,056][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:54:59,560][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:55:00,072][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:55:00,580][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:55:01,101][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:55:01,608][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:55:02,115][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:55:02,618][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:55:03,120][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:55:03,625][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:55:04,129][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:55:04,635][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:55:05,145][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:55:05,650][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:55:06,153][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:55:06,657][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:55:07,159][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:55:07,663][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:55:08,166][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:55:08,670][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:55:09,173][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:55:09,676][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:55:10,179][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:55:10,681][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:55:11,184][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:55:11,707][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:55:12,211][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:55:12,712][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:55:13,212][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:55:13,714][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:55:14,218][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:55:14,720][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:55:15,221][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:55:15,729][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:55:16,232][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:55:16,737][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:55:17,241][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:55:17,744][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:55:18,251][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:55:18,754][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:55:19,259][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:55:19,764][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:55:20,270][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:55:20,776][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:55:21,278][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:55:21,779][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:55:22,282][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:55:22,784][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:55:23,287][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:55:23,795][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:55:24,297][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:55:24,809][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10847 tokens. [2025-11-13 00:55:25,460][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:32 [2025-11-13 00:55:26,242][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:55:26,244][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:55:26,245][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:55:27,176][__main__][INFO] - Iteration 189 took 52s (30.81% Gen, 67.41% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 47m 47s. Estimated total time: 43h 33m 33s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 7s, 500 more iterations: 7h 15m 35s. [2025-11-13 00:55:27,178][__main__][INFO] - Starting iteration 189. [2025-11-13 00:55:27,662][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 00:55:27,662][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:55:40,161][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:55:42,812][__main__][INFO] - Number of regex retries in iteration 189: 1 [2025-11-13 00:55:42,813][__main__][INFO] - agents played in iteration 189 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:55:43,641][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:55:43,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:55:43,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:55:43,711][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:55:43,711][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:55:43,712][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:55:44,367][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:55:44,832][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:55:45,342][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:55:45,848][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:55:46,353][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:55:46,857][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:55:47,365][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:55:47,872][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:55:48,384][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:55:48,892][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:55:49,397][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:55:49,902][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:55:50,405][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:55:50,911][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:55:51,418][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:55:51,923][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:55:52,429][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:55:52,936][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:55:53,443][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:55:53,952][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:55:54,458][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:55:54,960][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:55:55,462][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:55:55,965][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:55:56,469][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:55:56,973][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:55:57,475][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:55:57,977][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:55:58,480][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:55:59,005][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:55:59,508][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:56:00,009][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:56:00,513][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:56:01,014][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:56:01,516][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:56:02,019][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:56:02,521][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:56:03,023][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:56:03,527][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:56:04,031][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:56:04,532][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:56:05,034][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:56:05,537][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:56:06,039][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:56:06,540][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:56:07,043][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:56:07,544][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:56:08,046][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:56:08,546][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:56:09,051][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:56:09,557][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:56:10,060][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:56:10,564][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:56:11,075][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:56:11,579][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:56:12,108][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:56:12,613][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:56:13,114][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:56:13,614][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:56:14,116][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:56:14,621][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:56:15,120][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:56:15,620][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:56:16,122][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:56:16,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10841 tokens. [2025-11-13 00:56:17,276][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:32 [2025-11-13 00:56:18,044][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:56:18,046][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:56:18,047][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:56:18,973][__main__][INFO] - Iteration 190 took 51s (29.53% Gen, 68.67% Train). Generation: 15s, Training: 35s. Estimated remaining time: 39h 58m 58s. Estimated total time: 42h 45m 36s. Time estimates for 10 more iterations: 8m 33s, 100 more iterations: 1h 25m 31s, 500 more iterations: 7h 7m 36s. [2025-11-13 00:56:18,975][__main__][INFO] - Starting iteration 190. [2025-11-13 00:56:19,439][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 18 and human policies 1. [2025-11-13 00:56:19,439][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:56:32,691][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:56:35,195][__main__][INFO] - Number of regex retries in iteration 190: 1 [2025-11-13 00:56:35,196][__main__][INFO] - agents played in iteration 190 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:56:36,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:56:36,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:56:36,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:56:36,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:56:36,100][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:56:36,101][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:56:36,770][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:56:37,233][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:56:37,742][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:56:38,250][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:56:38,753][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:56:39,263][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:56:39,770][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:56:40,276][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:56:40,779][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:56:41,286][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:56:41,790][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:56:42,304][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:56:42,809][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:56:43,313][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:56:43,819][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:56:44,325][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:56:44,833][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:56:45,340][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:56:45,848][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:56:46,356][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:56:46,860][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:56:47,367][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:56:47,873][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:56:48,379][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:56:48,884][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:56:49,394][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:56:49,898][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:56:50,402][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:56:50,908][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:56:51,440][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:56:51,942][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:56:52,444][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:56:52,947][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:56:53,450][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:56:53,953][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:56:54,455][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:56:54,957][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:56:55,460][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:56:55,961][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:56:56,463][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:56:56,965][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:56:57,470][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:56:57,974][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:56:58,478][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:56:58,984][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:56:59,490][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:56:59,997][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:57:00,518][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:57:01,023][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:57:01,531][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:57:02,041][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:57:02,550][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:57:03,057][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:57:03,560][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:57:04,063][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:57:04,571][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:57:05,076][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:57:05,582][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:57:06,085][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:57:06,591][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:57:07,109][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:57:07,612][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:57:08,114][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:57:08,616][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:57:09,118][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10846 tokens. [2025-11-13 00:57:09,772][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.29%, Current % of VRAM taken: 58.54%, Block Peak % of device VRAM: 62.39%, ΔTime: 00:00:33 [2025-11-13 00:57:10,529][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:57:10,530][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:57:10,532][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:57:12,338][__main__][INFO] - Iteration 191 took 52s (29.79% Gen, 66.80% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 17m 29s. Estimated total time: 44h 5m 0s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 10s, 500 more iterations: 7h 20m 50s. [2025-11-13 00:57:12,340][__main__][INFO] - Starting iteration 191. [2025-11-13 00:57:12,828][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 00:57:12,828][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:57:17,856][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:57:22,124][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:57:29,023][__main__][INFO] - Number of regex retries in iteration 191: 2 [2025-11-13 00:57:29,024][__main__][INFO] - agents played in iteration 191 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:57:29,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:57:29,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:57:29,866][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:57:29,887][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:57:29,888][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:57:29,888][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:57:30,531][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:57:31,015][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:57:31,523][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:57:32,028][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:57:32,537][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:57:33,043][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:57:33,548][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:57:34,051][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:57:34,554][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:57:35,063][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:57:35,570][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:57:36,072][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:57:36,573][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:57:37,077][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:57:37,581][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:57:38,086][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:57:38,591][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:57:39,097][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:57:39,602][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:57:40,104][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:57:40,613][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:57:41,116][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:57:41,635][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:57:42,139][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:57:42,643][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:57:43,148][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:57:43,653][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:57:44,156][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:57:44,657][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:57:45,159][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:57:45,662][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:57:46,164][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:57:46,665][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:57:47,167][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:57:47,668][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:57:48,173][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:57:48,677][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:57:49,183][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:57:49,697][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:57:50,199][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:57:50,715][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:57:51,226][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:57:51,731][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:57:52,262][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:57:52,770][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:57:53,274][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:57:53,778][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:57:54,281][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:57:54,790][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:57:55,294][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:57:55,796][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:57:56,306][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:57:56,818][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:57:57,338][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:57:57,852][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:57:58,367][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:57:58,880][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:57:59,385][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:57:59,889][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:58:00,392][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:58:00,895][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:58:01,411][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:58:01,914][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:58:02,417][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:58:02,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10873 tokens. [2025-11-13 00:58:03,573][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.29%, Current % of VRAM taken: 58.54%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:33 [2025-11-13 00:58:04,354][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:58:04,356][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:58:04,358][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:58:05,311][__main__][INFO] - Iteration 192 took 52s (30.86% Gen, 67.32% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 55m 47s. Estimated total time: 43h 44m 11s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 28s, 500 more iterations: 7h 17m 21s. [2025-11-13 00:58:05,313][__main__][INFO] - Starting iteration 192. [2025-11-13 00:58:05,792][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 00:58:05,792][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:58:11,239][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:58:20,625][__main__][INFO] - Number of regex retries in iteration 192: 1 [2025-11-13 00:58:20,626][__main__][INFO] - agents played in iteration 192 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:58:21,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:58:21,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:58:21,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:58:21,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:58:21,516][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:58:21,517][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:58:22,187][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:58:22,650][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:58:23,158][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:58:23,663][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:58:24,169][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:58:24,676][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:58:25,185][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:58:25,689][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:58:26,194][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:58:26,699][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:58:27,203][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:58:27,721][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:58:28,225][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:58:28,738][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:58:29,245][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:58:29,749][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:58:30,254][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:58:30,757][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:58:31,266][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:58:31,771][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:58:32,276][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:58:32,784][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:58:33,289][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:58:33,794][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:58:34,296][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:58:34,801][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:58:35,304][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:58:35,809][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:58:36,313][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:58:36,825][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:58:37,329][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:58:37,840][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:58:38,341][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:58:38,842][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:58:39,344][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:58:39,846][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:58:40,351][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:58:40,854][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:58:41,357][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:58:41,861][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:58:42,363][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:58:42,867][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:58:43,369][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:58:43,872][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:58:44,376][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:58:44,878][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:58:45,381][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:58:45,884][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:58:46,386][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:58:46,892][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:58:47,396][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:58:47,898][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:58:48,402][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:58:48,908][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:58:49,412][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:58:49,915][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:58:50,418][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:58:50,939][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:58:51,445][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:58:51,953][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:58:52,454][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:58:52,957][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:58:53,461][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:58:53,962][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:58:54,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10867 tokens. [2025-11-13 00:58:55,124][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:32 [2025-11-13 00:58:55,914][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:58:55,916][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:58:55,918][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:58:56,849][__main__][INFO] - Iteration 193 took 51s (29.05% Gen, 69.12% Train). Generation: 14s, Training: 35s. Estimated remaining time: 39h 43m 39s. Estimated total time: 42h 32m 55s. Time estimates for 10 more iterations: 8m 30s, 100 more iterations: 1h 25m 5s, 500 more iterations: 7h 5m 29s. [2025-11-13 00:58:56,851][__main__][INFO] - Starting iteration 193. [2025-11-13 00:58:57,320][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 00:58:57,321][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:59:08,018][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 00:59:11,848][__main__][INFO] - Number of regex retries in iteration 193: 1 [2025-11-13 00:59:11,849][__main__][INFO] - agents played in iteration 193 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 00:59:12,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:59:12,696][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:59:12,721][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:59:12,743][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 00:59:12,743][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 00:59:12,744][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 00:59:13,418][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 00:59:13,880][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 00:59:14,390][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 00:59:14,896][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 00:59:15,398][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 00:59:15,902][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 00:59:16,407][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 00:59:16,911][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 00:59:17,425][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 00:59:17,934][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 00:59:18,447][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 00:59:18,954][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 00:59:19,463][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 00:59:19,970][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 00:59:20,472][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 00:59:20,978][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 00:59:21,482][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 00:59:21,983][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 00:59:22,485][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 00:59:22,985][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 00:59:23,486][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 00:59:23,990][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 00:59:24,494][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 00:59:25,005][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 00:59:25,521][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 00:59:26,030][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 00:59:26,541][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 00:59:27,046][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 00:59:27,550][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 00:59:28,061][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 00:59:28,566][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 00:59:29,073][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 00:59:29,579][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 00:59:30,085][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 00:59:30,593][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 00:59:31,098][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 00:59:31,601][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 00:59:32,106][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 00:59:32,610][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 00:59:33,125][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 00:59:33,628][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 00:59:34,132][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 00:59:34,639][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 00:59:35,142][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 00:59:35,651][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 00:59:36,155][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 00:59:36,658][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 00:59:37,171][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 00:59:37,675][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 00:59:38,179][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 00:59:38,685][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 00:59:39,190][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 00:59:39,694][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 00:59:40,198][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 00:59:40,704][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 00:59:41,211][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 00:59:41,715][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 00:59:42,218][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 00:59:42,721][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 00:59:43,226][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 00:59:43,751][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 00:59:44,254][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 00:59:44,757][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 00:59:45,261][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 00:59:45,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10857 tokens. [2025-11-13 00:59:46,524][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 00:59:47,298][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 00:59:47,299][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 00:59:47,301][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 00:59:48,315][__main__][INFO] - Iteration 194 took 50s (28.49% Gen, 69.52% Train). Generation: 14s, Training: 35s. Estimated remaining time: 39h 39m 39s. Estimated total time: 42h 29m 46s. Time estimates for 10 more iterations: 8m 29s, 100 more iterations: 1h 24m 59s, 500 more iterations: 7h 4m 57s. [2025-11-13 00:59:48,317][__main__][INFO] - Starting iteration 194. [2025-11-13 00:59:48,808][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 00:59:48,808][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 00:59:52,960][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:00:04,713][__main__][INFO] - Number of regex retries in iteration 194: 1 [2025-11-13 01:00:04,713][__main__][INFO] - agents played in iteration 194 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:00:05,530][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:00:05,552][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:00:05,574][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:00:05,596][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:00:05,596][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:00:05,598][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:00:06,290][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:00:06,747][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:00:07,268][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:00:07,773][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:00:08,275][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:00:08,781][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:00:09,283][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:00:09,784][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:00:10,284][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:00:10,796][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:00:11,306][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:00:11,808][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:00:12,313][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:00:12,817][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:00:13,318][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:00:13,822][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:00:14,324][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:00:14,826][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:00:15,338][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:00:15,839][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:00:16,340][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:00:16,851][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:00:17,353][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:00:17,867][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:00:18,369][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:00:18,870][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:00:19,377][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:00:19,882][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:00:20,392][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:00:20,898][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:00:21,400][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:00:21,902][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:00:22,409][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:00:22,915][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:00:23,422][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:00:23,928][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:00:24,437][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:00:24,941][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:00:25,443][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:00:25,963][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:00:26,466][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:00:26,969][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:00:27,469][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:00:27,972][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:00:28,482][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:00:28,985][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:00:29,490][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:00:29,991][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:00:30,496][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:00:31,000][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:00:31,503][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:00:32,001][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:00:32,506][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:00:33,007][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:00:33,511][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:00:34,015][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:00:34,515][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:00:35,018][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:00:35,519][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:00:36,019][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:00:36,521][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:00:37,023][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:00:37,527][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:00:38,033][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:00:38,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10831 tokens. [2025-11-13 01:00:39,295][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:33 [2025-11-13 01:00:40,073][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:00:40,075][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:00:40,076][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:00:41,076][__main__][INFO] - Iteration 195 took 52s (30.43% Gen, 67.65% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 42m 28s. Estimated total time: 43h 33m 27s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 6s, 500 more iterations: 7h 15m 34s. [2025-11-13 01:00:41,078][__main__][INFO] - Starting iteration 195. [2025-11-13 01:00:41,587][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 01:00:41,587][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:00:45,745][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:00:50,051][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:00:57,609][__main__][INFO] - Number of regex retries in iteration 195: 2 [2025-11-13 01:00:57,609][__main__][INFO] - agents played in iteration 195 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:00:58,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:00:58,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:00:58,492][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:00:58,514][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:00:58,515][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:00:58,515][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:00:59,147][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:00:59,607][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:01:00,115][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:01:00,624][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:01:01,136][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:01:01,644][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:01:02,153][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:01:02,671][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:01:03,175][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:01:03,677][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:01:04,180][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:01:04,681][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:01:05,188][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:01:05,690][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:01:06,192][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:01:06,696][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:01:07,197][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:01:07,698][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:01:08,202][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:01:08,704][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:01:09,208][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:01:09,709][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:01:10,213][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:01:10,714][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:01:11,216][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:01:11,718][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:01:12,221][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:01:12,723][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:01:13,233][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:01:13,739][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:01:14,244][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:01:14,759][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:01:15,263][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:01:15,776][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:01:16,282][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:01:16,786][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:01:17,292][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:01:17,796][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:01:18,299][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:01:18,800][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:01:19,302][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:01:19,807][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:01:20,310][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:01:20,813][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:01:21,320][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:01:21,822][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:01:22,326][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:01:22,828][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:01:23,332][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:01:23,837][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:01:24,340][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:01:24,844][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:01:25,350][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:01:25,855][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:01:26,382][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:01:26,884][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:01:27,386][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:01:27,890][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:01:28,396][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:01:28,902][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:01:29,407][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:01:29,915][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:01:30,420][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:01:30,923][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:01:31,428][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10861 tokens. [2025-11-13 01:01:32,151][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 01:01:32,919][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:01:32,921][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:01:32,923][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:01:33,816][__main__][INFO] - Iteration 196 took 52s (30.68% Gen, 67.61% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 39m 36s. Estimated total time: 43h 31m 28s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 2s, 500 more iterations: 7h 15m 14s. [2025-11-13 01:01:33,818][__main__][INFO] - Starting iteration 196. [2025-11-13 01:01:34,293][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 01:01:34,294][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:01:50,855][__main__][INFO] - Number of regex retries in iteration 196: 0 [2025-11-13 01:01:50,856][__main__][INFO] - agents played in iteration 196 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:01:51,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:01:51,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:01:51,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:01:51,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:01:51,845][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:01:51,846][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:01:52,513][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:01:52,983][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:01:53,494][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:01:53,998][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:01:54,504][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:01:55,014][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:01:55,518][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:01:56,021][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:01:56,522][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:01:57,025][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:01:57,530][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:01:58,036][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:01:58,537][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:01:59,039][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:01:59,547][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:02:00,046][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:02:00,543][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:02:01,044][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:02:01,545][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:02:02,045][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:02:02,546][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:02:03,052][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:02:03,554][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:02:04,055][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:02:04,557][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:02:05,063][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:02:05,567][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:02:06,075][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:02:06,578][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:02:07,082][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:02:07,589][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:02:08,093][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:02:08,597][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:02:09,102][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:02:09,609][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:02:10,116][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:02:10,621][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:02:11,122][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:02:11,628][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:02:12,136][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:02:12,639][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:02:13,141][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:02:13,644][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:02:14,146][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:02:14,650][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:02:15,153][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:02:15,656][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:02:16,159][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:02:16,661][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:02:17,163][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:02:17,666][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:02:18,183][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:02:18,684][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:02:19,187][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:02:19,694][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:02:20,199][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:02:20,708][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:02:21,215][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:02:21,720][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:02:22,225][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:02:22,731][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:02:23,235][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:02:23,739][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:02:24,245][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:02:24,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10856 tokens. [2025-11-13 01:02:25,487][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:32 [2025-11-13 01:02:26,272][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:02:26,273][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:02:26,275][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:02:27,208][__main__][INFO] - Iteration 197 took 52s (31.30% Gen, 66.93% Train). Generation: 16s, Training: 35s. Estimated remaining time: 41h 13m 2s. Estimated total time: 44h 5m 47s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 11s, 500 more iterations: 7h 20m 57s. [2025-11-13 01:02:27,210][__main__][INFO] - Starting iteration 197. [2025-11-13 01:02:27,685][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 01:02:27,686][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:02:42,610][__main__][INFO] - Number of regex retries in iteration 197: 0 [2025-11-13 01:02:42,610][__main__][INFO] - agents played in iteration 197 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:02:43,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:02:43,424][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:02:43,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:02:43,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:02:43,472][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:02:43,473][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:02:44,110][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:02:44,581][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:02:45,086][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:02:45,605][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:02:46,107][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:02:46,608][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:02:47,114][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:02:47,615][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:02:48,118][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:02:48,617][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:02:49,119][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:02:49,623][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:02:50,124][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:02:50,623][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:02:51,122][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:02:51,623][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:02:52,127][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:02:52,634][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:02:53,137][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:02:53,641][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:02:54,144][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:02:54,646][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:02:55,147][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:02:55,651][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:02:56,157][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:02:56,659][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:02:57,162][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:02:57,667][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:02:58,176][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:02:58,688][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:02:59,194][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:02:59,700][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:03:00,208][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:03:00,712][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:03:01,221][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:03:01,723][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:03:02,228][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:03:02,739][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:03:03,243][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:03:03,748][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:03:04,252][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:03:04,755][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:03:05,256][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:03:05,758][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:03:06,261][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:03:06,765][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:03:07,267][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:03:07,769][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:03:08,283][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:03:08,787][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:03:09,302][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:03:09,806][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:03:10,312][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:03:10,820][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:03:11,325][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:03:11,830][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:03:12,338][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:03:12,841][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:03:13,348][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:03:13,856][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:03:14,360][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:03:14,864][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:03:15,369][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:03:15,886][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:03:16,394][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10857 tokens. [2025-11-13 01:03:17,134][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.40%, ΔTime: 00:00:33 [2025-11-13 01:03:17,907][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:03:17,909][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:03:17,911][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:03:18,788][__main__][INFO] - Iteration 198 took 51s (29.20% Gen, 69.08% Train). Generation: 14s, Training: 35s. Estimated remaining time: 39h 41m 33s. Estimated total time: 42h 35m 10s. Time estimates for 10 more iterations: 8m 31s, 100 more iterations: 1h 25m 10s, 500 more iterations: 7h 5m 51s. [2025-11-13 01:03:18,790][__main__][INFO] - Starting iteration 198. [2025-11-13 01:03:19,300][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 01:03:19,300][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:03:34,108][__main__][INFO] - Number of regex retries in iteration 198: 0 [2025-11-13 01:03:34,109][__main__][INFO] - agents played in iteration 198 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:03:34,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:03:34,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:03:35,017][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:03:35,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:03:35,040][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:03:35,041][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:03:35,781][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:03:36,241][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:03:36,750][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:03:37,258][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:03:37,760][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:03:38,264][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:03:38,766][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:03:39,268][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:03:39,771][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:03:40,274][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:03:40,774][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:03:41,283][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:03:41,782][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:03:42,282][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:03:42,781][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:03:43,280][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:03:43,781][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:03:44,285][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:03:44,786][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:03:45,287][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:03:45,787][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:03:46,292][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:03:46,793][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:03:47,294][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:03:47,804][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:03:48,302][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:03:48,813][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:03:49,314][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:03:49,813][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:03:50,313][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:03:50,814][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:03:51,313][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:03:51,818][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:03:52,324][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:03:52,836][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:03:53,340][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:03:53,845][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:03:54,349][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:03:54,857][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:03:55,364][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:03:55,870][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:03:56,374][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:03:56,881][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:03:57,386][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:03:57,895][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:03:58,399][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:03:58,901][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:03:59,405][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:03:59,909][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:04:00,411][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:04:00,914][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:04:01,416][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:04:01,933][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:04:02,435][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:04:02,937][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:04:03,438][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:04:03,942][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:04:04,452][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:04:04,957][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:04:05,463][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:04:05,971][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:04:06,478][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:04:06,984][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:04:07,485][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:04:07,993][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10865 tokens. [2025-11-13 01:04:08,738][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.42%, ΔTime: 00:00:32 [2025-11-13 01:04:09,485][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:04:09,486][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:04:09,488][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:04:10,398][__main__][INFO] - Iteration 199 took 51s (28.98% Gen, 69.24% Train). Generation: 14s, Training: 35s. Estimated remaining time: 39h 40m 26s. Estimated total time: 42h 34m 55s. Time estimates for 10 more iterations: 8m 30s, 100 more iterations: 1h 25m 9s, 500 more iterations: 7h 5m 49s. [2025-11-13 01:04:10,400][__main__][INFO] - Starting iteration 199. [2025-11-13 01:04:10,884][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 01:04:10,885][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:04:16,857][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:04:22,450][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:04:27,519][__main__][INFO] - Number of regex retries in iteration 199: 2 [2025-11-13 01:04:27,520][__main__][INFO] - agents played in iteration 199 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:04:28,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:04:28,424][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:04:28,447][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:04:28,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:04:28,470][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:04:28,471][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:04:29,105][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:04:29,570][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:04:30,077][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:04:30,579][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:04:31,084][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:04:31,588][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:04:32,093][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:04:32,597][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:04:33,101][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:04:33,603][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:04:34,106][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:04:34,612][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:04:35,118][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:04:35,621][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:04:36,127][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:04:36,631][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:04:37,135][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:04:37,637][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:04:38,139][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:04:38,660][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:04:39,163][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:04:39,666][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:04:40,172][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:04:40,675][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:04:41,178][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:04:41,678][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:04:42,178][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:04:42,679][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:04:43,179][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:04:43,681][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:04:44,181][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:04:44,683][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:04:45,191][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:04:45,697][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:04:46,202][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:04:46,710][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:04:47,218][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:04:47,723][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:04:48,238][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:04:48,742][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:04:49,247][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:04:49,752][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:04:50,258][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:04:50,766][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:04:51,271][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:04:51,774][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:04:52,279][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:04:52,784][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:04:53,291][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:04:53,794][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:04:54,297][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:04:54,800][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:04:55,304][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:04:55,810][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:04:56,314][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:04:56,820][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:04:57,341][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:04:57,844][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:04:58,357][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:04:58,863][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:04:59,370][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:04:59,877][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:05:00,384][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:05:00,890][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:05:01,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10862 tokens. [2025-11-13 01:05:02,138][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.26%, Current % of VRAM taken: 58.51%, Block Peak % of device VRAM: 62.49%, ΔTime: 00:00:33 [2025-11-13 01:05:02,863][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:05:02,865][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:05:02,867][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:05:03,770][__main__][INFO] - Iteration 200 took 52s (31.45% Gen, 66.84% Train). Generation: 16s, Training: 35s. Estimated remaining time: 41h 8m 57s. Estimated total time: 44h 4m 19s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 8s, 500 more iterations: 7h 20m 43s. [2025-11-13 01:05:03,772][__main__][INFO] - Starting iteration 200. [2025-11-13 01:05:04,242][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 19 and human policies 1. [2025-11-13 01:05:04,242][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:05:20,002][__main__][INFO] - Number of regex retries in iteration 200: 0 [2025-11-13 01:05:20,003][__main__][INFO] - agents played in iteration 200 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:05:20,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:05:20,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:05:20,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:05:20,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:05:20,871][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:05:20,873][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:05:21,506][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:05:21,974][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:05:22,481][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:05:22,986][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:05:23,490][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:05:23,992][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:05:24,496][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:05:24,999][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:05:25,502][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:05:26,004][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:05:26,507][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:05:27,007][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:05:27,506][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:05:28,009][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:05:28,516][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:05:29,019][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:05:29,520][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:05:30,022][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:05:30,525][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:05:31,026][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:05:31,529][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:05:32,031][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:05:32,545][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:05:33,048][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:05:33,551][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:05:34,056][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:05:34,559][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:05:35,068][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:05:35,569][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:05:36,069][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:05:36,577][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:05:37,076][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:05:37,576][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:05:38,080][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:05:38,585][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:05:39,092][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:05:39,599][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:05:40,104][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:05:40,612][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:05:41,116][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:05:41,621][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:05:42,125][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:05:42,636][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:05:43,144][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:05:43,648][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:05:44,156][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:05:44,664][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:05:45,169][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:05:45,679][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:05:46,184][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:05:46,687][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:05:47,193][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:05:47,698][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:05:48,202][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:05:48,708][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:05:49,217][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:05:49,724][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:05:50,230][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:05:50,736][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:05:51,260][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:05:51,766][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:05:52,274][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:05:52,780][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:05:53,286][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:05:53,793][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10866 tokens. [2025-11-13 01:05:54,543][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:33 [2025-11-13 01:05:55,283][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:05:55,285][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:05:55,287][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:05:57,065][__main__][INFO] - Iteration 201 took 52s (29.84% Gen, 66.80% Train). Generation: 15s, Training: 35s. Estimated remaining time: 41h 4m 55s. Estimated total time: 44h 1m 11s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 2s, 500 more iterations: 7h 20m 11s. [2025-11-13 01:05:57,067][__main__][INFO] - Starting iteration 201. [2025-11-13 01:05:57,548][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 01:05:57,548][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:06:12,784][mllm.models.large_language_model_local][WARNING] - Response Proposal: 9 hats, 0 books, 11 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:06:13,801][__main__][INFO] - Number of regex retries in iteration 201: 1 [2025-11-13 01:06:13,801][__main__][INFO] - agents played in iteration 201 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:06:14,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:06:14,641][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:06:14,668][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:06:14,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:06:14,693][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:06:14,695][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:06:15,320][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:06:15,784][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:06:16,293][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:06:16,795][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:06:17,300][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:06:17,803][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:06:18,306][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:06:18,813][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:06:19,317][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:06:19,819][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:06:20,319][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:06:20,823][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:06:21,328][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:06:21,832][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:06:22,334][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:06:22,839][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:06:23,340][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:06:23,841][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:06:24,342][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:06:24,845][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:06:25,349][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:06:25,851][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:06:26,354][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:06:26,855][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:06:27,357][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:06:27,874][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:06:28,380][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:06:28,884][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:06:29,401][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:06:29,904][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:06:30,406][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:06:30,908][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:06:31,412][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:06:31,915][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:06:32,418][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:06:32,919][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:06:33,421][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:06:33,925][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:06:34,433][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:06:34,939][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:06:35,444][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:06:35,961][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:06:36,462][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:06:36,978][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:06:37,482][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:06:37,986][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:06:38,495][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:06:39,000][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:06:39,503][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:06:40,007][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:06:40,515][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:06:41,027][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:06:41,533][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:06:42,040][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:06:42,555][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:06:43,063][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:06:43,579][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:06:44,084][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:06:44,590][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:06:45,102][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:06:45,607][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:06:46,113][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:06:46,619][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:06:47,123][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:06:47,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10823 tokens. [2025-11-13 01:06:48,341][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.40%, ΔTime: 00:00:33 [2025-11-13 01:06:49,114][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:06:49,116][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:06:49,117][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:06:50,181][__main__][INFO] - Iteration 202 took 52s (30.88% Gen, 67.10% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 54m 35s. Estimated total time: 43h 51m 43s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 43s, 500 more iterations: 7h 18m 37s. [2025-11-13 01:06:50,184][__main__][INFO] - Starting iteration 202. [2025-11-13 01:06:50,678][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 01:06:50,679][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:07:06,153][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:07:06,941][__main__][INFO] - Number of regex retries in iteration 202: 1 [2025-11-13 01:07:06,942][__main__][INFO] - agents played in iteration 202 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:07:07,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:07:07,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:07:07,840][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:07:07,863][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:07:07,864][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:07:07,865][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:07:08,495][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:07:08,956][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:07:09,461][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:07:09,977][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:07:10,479][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:07:10,983][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:07:11,487][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:07:11,988][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:07:12,490][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:07:12,993][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:07:13,495][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:07:14,007][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:07:14,510][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:07:15,012][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:07:15,513][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:07:16,015][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:07:16,520][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:07:17,020][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:07:17,526][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:07:18,032][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:07:18,535][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:07:19,038][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:07:19,540][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:07:20,041][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:07:20,544][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:07:21,066][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:07:21,569][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:07:22,079][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:07:22,581][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:07:23,091][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:07:23,595][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:07:24,097][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:07:24,610][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:07:25,111][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:07:25,613][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:07:26,114][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:07:26,617][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:07:27,121][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:07:27,622][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:07:28,130][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:07:28,637][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:07:29,139][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:07:29,643][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:07:30,147][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:07:30,655][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:07:31,159][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:07:31,664][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:07:32,171][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:07:32,685][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:07:33,191][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:07:33,703][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:07:34,210][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:07:34,713][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:07:35,220][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:07:35,724][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:07:36,228][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:07:36,735][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:07:37,242][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:07:37,749][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:07:38,256][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:07:38,761][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:07:39,264][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:07:39,769][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:07:40,272][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:07:40,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10872 tokens. [2025-11-13 01:07:41,494][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.26%, Current % of VRAM taken: 58.50%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 01:07:42,249][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:07:42,251][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:07:42,253][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:07:43,185][__main__][INFO] - Iteration 203 took 52s (30.97% Gen, 67.25% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 47m 22s. Estimated total time: 43h 45m 23s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 30s, 500 more iterations: 7h 17m 33s. [2025-11-13 01:07:43,188][__main__][INFO] - Starting iteration 203. [2025-11-13 01:07:43,658][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 01:07:43,659][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:07:59,946][__main__][INFO] - Number of regex retries in iteration 203: 0 [2025-11-13 01:07:59,947][__main__][INFO] - agents played in iteration 203 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:08:00,748][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:08:00,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:08:00,802][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:08:00,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:08:00,825][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:08:00,826][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:08:01,455][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:08:01,913][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:08:02,418][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:08:02,917][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:08:03,419][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:08:03,928][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:08:04,432][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:08:04,936][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:08:05,439][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:08:05,938][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:08:06,440][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:08:06,937][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:08:07,438][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:08:07,942][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:08:08,447][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:08:08,949][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:08:09,451][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:08:09,953][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:08:10,456][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:08:10,956][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:08:11,458][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:08:11,963][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:08:12,464][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:08:12,965][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:08:13,467][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:08:13,971][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:08:14,472][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:08:14,974][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:08:15,477][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:08:15,982][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:08:16,484][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:08:16,986][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:08:17,496][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:08:17,996][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:08:18,508][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:08:19,011][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:08:19,513][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:08:20,029][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:08:20,531][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:08:21,038][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:08:21,541][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:08:22,048][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:08:22,555][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:08:23,061][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:08:23,567][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:08:24,073][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:08:24,578][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:08:25,084][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:08:25,588][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:08:26,093][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:08:26,615][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:08:27,119][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:08:27,622][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:08:28,126][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:08:28,631][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:08:29,141][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:08:29,644][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:08:30,148][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:08:30,654][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:08:31,157][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:08:31,661][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:08:32,166][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:08:32,672][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:08:33,179][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:08:33,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10811 tokens. [2025-11-13 01:08:34,380][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:32 [2025-11-13 01:08:35,130][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:08:35,131][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:08:35,133][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:08:36,106][__main__][INFO] - Iteration 204 took 52s (31.05% Gen, 67.09% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 43m 33s. Estimated total time: 43h 42m 27s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 24s, 500 more iterations: 7h 17m 4s. [2025-11-13 01:08:36,108][__main__][INFO] - Starting iteration 204. [2025-11-13 01:08:36,590][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 01:08:36,591][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:08:46,358][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:08:53,373][__main__][INFO] - Number of regex retries in iteration 204: 1 [2025-11-13 01:08:53,374][__main__][INFO] - agents played in iteration 204 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:08:54,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:08:54,199][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:08:54,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:08:54,247][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:08:54,248][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:08:54,249][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:08:54,869][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:08:55,331][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:08:56,011][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:08:56,513][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:08:57,022][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:08:57,523][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:08:58,032][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:08:58,534][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:08:59,037][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:08:59,539][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:09:00,042][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:09:00,546][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:09:01,050][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:09:01,553][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:09:02,055][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:09:02,558][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:09:03,060][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:09:03,561][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:09:04,060][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:09:04,562][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:09:05,065][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:09:05,571][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:09:06,077][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:09:06,579][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:09:07,081][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:09:07,581][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:09:08,082][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:09:08,582][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:09:09,085][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:09:09,585][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:09:10,086][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:09:10,591][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:09:11,095][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:09:11,599][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:09:12,101][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:09:12,609][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:09:13,111][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:09:13,618][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:09:14,129][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:09:14,639][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:09:15,153][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:09:15,658][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:09:16,166][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:09:16,676][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:09:17,184][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:09:17,694][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:09:18,199][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:09:18,706][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:09:19,225][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:09:19,731][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:09:20,236][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:09:20,741][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:09:21,246][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:09:21,755][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:09:22,261][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:09:22,764][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:09:23,268][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:09:23,772][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:09:24,278][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:09:24,783][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:09:25,287][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:09:25,793][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:09:26,298][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:09:26,802][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:09:27,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10823 tokens. [2025-11-13 01:09:28,011][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 01:09:28,782][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:09:28,785][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:09:28,787][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:09:29,781][__main__][INFO] - Iteration 205 took 53s (31.55% Gen, 66.58% Train). Generation: 16s, Training: 35s. Estimated remaining time: 41h 19m 44s. Estimated total time: 44h 19m 32s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 39s, 500 more iterations: 7h 23m 15s. [2025-11-13 01:09:29,783][__main__][INFO] - Starting iteration 205. [2025-11-13 01:09:30,266][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 01:09:30,267][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:09:34,625][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:09:34,727][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:09:41,750][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:09:46,720][__main__][INFO] - Number of regex retries in iteration 205: 3 [2025-11-13 01:09:46,721][__main__][INFO] - agents played in iteration 205 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:09:47,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:09:47,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:09:47,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:09:47,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:09:47,710][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:09:47,712][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:09:48,352][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:09:48,809][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:09:49,315][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:09:49,819][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:09:50,320][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:09:50,838][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:09:51,342][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:09:51,844][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:09:52,346][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:09:52,851][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:09:53,360][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:09:53,863][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:09:54,366][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:09:54,873][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:09:55,376][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:09:55,879][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:09:56,381][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:09:56,882][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:09:57,385][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:09:57,887][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:09:58,385][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:09:58,893][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:09:59,396][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:09:59,897][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:10:00,397][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:10:00,899][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:10:01,404][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:10:01,908][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:10:02,412][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:10:02,922][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:10:03,424][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:10:03,936][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:10:04,439][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:10:04,944][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:10:05,472][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:10:05,978][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:10:06,490][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:10:07,002][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:10:07,509][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:10:08,029][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:10:08,537][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:10:09,048][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:10:09,554][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:10:10,061][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:10:10,569][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:10:11,076][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:10:11,580][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:10:12,086][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:10:12,591][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:10:13,097][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:10:13,603][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:10:14,110][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:10:14,626][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:10:15,134][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:10:15,645][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:10:16,155][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:10:16,659][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:10:17,165][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:10:17,673][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:10:18,182][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:10:18,691][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:10:19,200][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:10:19,717][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:10:20,227][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:10:20,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10855 tokens. [2025-11-13 01:10:21,443][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.28%, Current % of VRAM taken: 58.53%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 01:10:22,243][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:10:22,245][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:10:22,246][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:10:23,207][__main__][INFO] - Iteration 206 took 52s (31.08% Gen, 67.10% Train). Generation: 16s, Training: 35s. Estimated remaining time: 41h 6m 25s. Estimated total time: 44h 7m 6s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 14s, 500 more iterations: 7h 21m 11s. [2025-11-13 01:10:23,209][__main__][INFO] - Starting iteration 206. [2025-11-13 01:10:23,685][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 01:10:23,686][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:10:38,346][__main__][INFO] - Number of regex retries in iteration 206: 0 [2025-11-13 01:10:38,346][__main__][INFO] - agents played in iteration 206 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:10:39,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:10:39,219][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:10:39,245][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:10:39,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:10:39,267][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:10:39,268][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:10:39,889][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:10:40,349][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:10:40,864][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:10:41,368][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:10:41,878][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:10:42,380][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:10:42,882][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:10:43,386][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:10:43,892][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:10:44,392][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:10:44,892][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:10:45,390][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:10:45,891][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:10:46,390][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:10:46,889][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:10:47,389][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:10:47,887][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:10:48,389][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:10:48,892][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:10:49,395][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:10:49,903][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:10:50,405][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:10:50,907][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:10:51,410][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:10:51,913][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:10:52,426][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:10:52,928][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:10:53,430][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:10:53,936][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:10:54,438][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:10:54,946][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:10:55,449][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:10:55,956][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:10:56,466][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:10:56,971][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:10:57,478][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:10:57,986][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:10:58,496][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:10:59,005][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:10:59,510][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:11:00,016][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:11:00,529][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:11:01,033][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:11:01,540][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:11:02,049][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:11:02,556][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:11:03,064][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:11:03,570][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:11:04,076][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:11:04,582][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:11:05,089][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:11:05,595][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:11:06,103][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:11:06,608][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:11:07,119][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:11:07,624][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:11:08,130][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:11:08,639][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:11:09,145][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:11:09,651][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:11:10,156][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:11:10,661][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:11:11,170][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:11:11,674][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:11:12,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10850 tokens. [2025-11-13 01:11:12,907][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:33 [2025-11-13 01:11:13,693][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:11:13,695][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:11:13,697][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:11:14,617][__main__][INFO] - Iteration 207 took 50s (28.78% Gen, 69.41% Train). Generation: 14s, Training: 35s. Estimated remaining time: 39h 25m 5s. Estimated total time: 42h 26m 38s. Time estimates for 10 more iterations: 8m 29s, 100 more iterations: 1h 24m 53s, 500 more iterations: 7h 4m 26s. [2025-11-13 01:11:14,619][__main__][INFO] - Starting iteration 207. [2025-11-13 01:11:15,102][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 01:11:15,102][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:11:21,848][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:11:27,682][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:11:30,765][__main__][INFO] - Number of regex retries in iteration 207: 2 [2025-11-13 01:11:30,766][__main__][INFO] - agents played in iteration 207 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:11:31,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:11:31,563][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:11:31,588][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:11:31,611][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:11:31,611][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:11:31,612][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:11:32,253][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:11:32,713][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:11:33,222][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:11:33,724][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:11:34,225][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:11:34,737][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:11:35,239][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:11:35,858][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:11:36,373][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:11:36,878][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:11:37,383][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:11:37,885][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:11:38,388][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:11:38,892][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:11:39,394][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:11:39,897][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:11:40,399][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:11:40,900][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:11:41,405][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:11:41,906][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:11:42,408][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:11:42,911][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:11:43,413][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:11:43,940][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:11:44,441][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:11:44,940][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:11:45,442][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:11:45,941][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:11:46,459][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:11:46,963][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:11:47,469][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:11:47,976][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:11:48,482][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:11:48,989][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:11:49,496][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:11:50,002][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:11:50,513][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:11:51,020][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:11:51,536][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:11:52,042][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:11:52,549][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:11:53,078][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:11:53,587][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:11:54,095][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:11:54,602][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:11:55,110][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:11:55,629][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:11:56,137][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:11:56,649][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:11:57,156][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:11:57,661][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:11:58,170][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:11:58,682][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:11:59,189][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:11:59,697][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:12:00,203][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:12:00,710][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:12:01,216][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:12:01,721][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:12:02,234][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:12:02,739][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:12:03,244][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:12:03,750][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:12:04,254][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:12:04,760][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10840 tokens. [2025-11-13 01:12:05,438][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:33 [2025-11-13 01:12:06,212][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:12:06,214][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:12:06,216][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:12:07,220][__main__][INFO] - Iteration 208 took 52s (30.05% Gen, 68.02% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 23m 29s. Estimated total time: 43h 25m 55s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 51s, 500 more iterations: 7h 14m 19s. [2025-11-13 01:12:07,222][__main__][INFO] - Starting iteration 208. [2025-11-13 01:12:07,703][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 01:12:07,704][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:12:17,464][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:12:23,768][__main__][INFO] - Number of regex retries in iteration 208: 1 [2025-11-13 01:12:23,769][__main__][INFO] - agents played in iteration 208 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:12:24,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:12:24,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:12:24,663][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:12:24,686][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:12:24,687][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:12:24,687][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:12:25,307][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:12:25,775][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:12:26,280][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:12:26,783][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:12:27,286][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:12:27,787][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:12:28,288][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:12:28,791][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:12:29,293][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:12:29,798][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:12:30,300][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:12:30,801][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:12:31,303][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:12:31,802][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:12:32,306][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:12:32,807][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:12:33,308][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:12:33,809][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:12:34,312][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:12:34,812][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:12:35,312][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:12:35,816][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:12:36,321][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:12:36,822][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:12:37,321][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:12:37,822][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:12:38,322][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:12:38,824][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:12:39,325][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:12:39,826][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:12:40,335][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:12:40,840][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:12:41,350][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:12:41,856][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:12:42,362][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:12:42,873][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:12:43,381][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:12:43,891][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:12:44,395][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:12:44,901][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:12:45,409][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:12:45,916][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:12:46,421][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:12:46,927][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:12:47,430][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:12:47,935][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:12:48,440][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:12:48,950][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:12:49,461][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:12:49,964][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:12:50,468][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:12:50,976][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:12:51,483][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:12:51,995][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:12:52,500][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:12:53,005][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:12:53,508][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:12:54,013][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:12:54,520][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:12:55,026][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:12:55,533][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:12:56,041][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:12:56,546][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:12:57,053][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:12:57,559][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10858 tokens. [2025-11-13 01:12:58,314][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.22%, ΔTime: 00:00:33 [2025-11-13 01:12:59,068][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:12:59,070][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:12:59,071][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:13:00,023][__main__][INFO] - Iteration 209 took 52s (30.70% Gen, 67.47% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 32m 42s. Estimated total time: 43h 36m 0s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 12s, 500 more iterations: 7h 16m 0s. [2025-11-13 01:13:00,025][__main__][INFO] - Starting iteration 209. [2025-11-13 01:13:00,513][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 01:13:00,514][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:13:14,007][__main__][INFO] - Number of regex retries in iteration 209: 0 [2025-11-13 01:13:14,007][__main__][INFO] - agents played in iteration 209 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:13:14,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:13:14,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:13:14,986][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:13:15,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:13:15,009][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:13:15,010][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:13:15,636][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:13:16,093][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:13:16,598][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:13:17,105][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:13:17,611][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:13:18,114][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:13:18,614][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:13:19,114][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:13:19,617][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:13:20,121][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:13:20,622][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:13:21,121][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:13:21,622][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:13:22,123][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:13:22,625][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:13:23,127][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:13:23,631][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:13:24,136][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:13:24,641][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:13:25,143][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:13:25,645][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:13:26,169][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:13:26,673][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:13:27,175][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:13:27,695][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:13:28,197][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:13:28,698][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:13:29,200][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:13:29,702][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:13:30,205][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:13:30,706][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:13:31,209][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:13:31,715][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:13:32,218][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:13:32,722][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:13:33,226][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:13:33,735][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:13:34,240][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:13:34,746][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:13:35,253][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:13:35,768][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:13:36,278][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:13:36,792][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:13:37,297][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:13:37,799][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:13:38,308][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:13:38,813][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:13:39,318][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:13:39,823][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:13:40,326][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:13:40,829][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:13:41,333][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:13:41,840][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:13:42,347][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:13:42,853][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:13:43,358][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:13:43,862][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:13:44,366][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:13:44,884][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:13:45,389][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:13:45,892][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:13:46,397][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:13:46,901][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:13:47,408][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:13:47,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10846 tokens. [2025-11-13 01:13:48,630][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:33 [2025-11-13 01:13:49,366][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:13:49,368][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:13:49,369][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:13:50,277][__main__][INFO] - Iteration 210 took 49s (27.12% Gen, 71.06% Train). Generation: 13s, Training: 35s. Estimated remaining time: 38h 24m 4s. Estimated total time: 41h 28m 13s. Time estimates for 10 more iterations: 8m 17s, 100 more iterations: 1h 22m 56s, 500 more iterations: 6h 54m 42s. [2025-11-13 01:13:50,279][__main__][INFO] - Starting iteration 210. [2025-11-13 01:13:50,756][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 20 and human policies 1. [2025-11-13 01:13:50,757][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:13:55,621][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:14:06,019][__main__][INFO] - Number of regex retries in iteration 210: 1 [2025-11-13 01:14:06,019][__main__][INFO] - agents played in iteration 210 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:14:06,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:14:06,825][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:14:06,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:14:06,871][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:14:06,872][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:14:06,873][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:14:07,496][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:14:07,954][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:14:08,472][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:14:08,974][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:14:09,477][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:14:09,978][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:14:10,481][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:14:10,999][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:14:11,504][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:14:12,009][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:14:12,511][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:14:13,016][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:14:13,521][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:14:14,023][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:14:14,526][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:14:15,031][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:14:15,534][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:14:16,036][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:14:16,537][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:14:17,038][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:14:17,540][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:14:18,041][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:14:18,543][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:14:19,043][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:14:19,546][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:14:20,054][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:14:20,557][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:14:21,061][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:14:21,575][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:14:22,078][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:14:22,594][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:14:23,096][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:14:23,601][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:14:24,106][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:14:24,612][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:14:25,120][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:14:25,624][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:14:26,132][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:14:26,641][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:14:27,145][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:14:27,650][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:14:28,160][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:14:28,664][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:14:29,182][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:14:29,688][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:14:30,193][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:14:30,703][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:14:31,211][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:14:31,718][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:14:32,223][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:14:32,727][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:14:33,237][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:14:33,741][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:14:34,244][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:14:34,748][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:14:35,252][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:14:35,758][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:14:36,261][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:14:36,762][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:14:37,282][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:14:37,791][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:14:38,304][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:14:38,808][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:14:39,316][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:14:39,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10850 tokens. [2025-11-13 01:14:40,586][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 01:14:41,344][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:14:41,346][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:14:41,348][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:14:42,993][__main__][INFO] - Iteration 211 took 52s (29.22% Gen, 67.63% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 26m 50s. Estimated total time: 43h 31m 51s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 3s, 500 more iterations: 7h 15m 18s. [2025-11-13 01:14:42,995][__main__][INFO] - Starting iteration 211. [2025-11-13 01:14:43,485][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 01:14:43,486][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:15:00,397][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:15:01,208][__main__][INFO] - Number of regex retries in iteration 211: 1 [2025-11-13 01:15:01,209][__main__][INFO] - agents played in iteration 211 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:15:02,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:15:02,079][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:15:02,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:15:02,127][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:15:02,127][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:15:02,128][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:15:02,759][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:15:03,219][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:15:03,726][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:15:04,228][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:15:04,735][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:15:05,237][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:15:05,742][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:15:06,243][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:15:06,744][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:15:07,248][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:15:07,750][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:15:08,254][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:15:08,758][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:15:09,260][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:15:09,766][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:15:10,269][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:15:10,772][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:15:11,287][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:15:11,788][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:15:12,299][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:15:12,802][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:15:13,304][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:15:13,804][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:15:14,308][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:15:14,816][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:15:15,324][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:15:15,829][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:15:16,336][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:15:16,840][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:15:17,345][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:15:17,852][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:15:18,359][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:15:18,867][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:15:19,372][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:15:19,878][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:15:20,386][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:15:20,893][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:15:21,398][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:15:21,904][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:15:22,405][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:15:22,914][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:15:23,423][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:15:23,932][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:15:24,436][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:15:24,941][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:15:25,449][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:15:25,954][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:15:26,459][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:15:26,977][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:15:27,481][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:15:28,000][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:15:28,506][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:15:29,016][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:15:29,524][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:15:30,043][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:15:30,554][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:15:31,062][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:15:31,571][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:15:32,086][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:15:32,592][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:15:33,106][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:15:33,614][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:15:34,120][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:15:34,628][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:15:35,135][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10844 tokens. [2025-11-13 01:15:35,885][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:33 [2025-11-13 01:15:36,618][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:15:36,620][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:15:36,622][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:15:37,504][__main__][INFO] - Iteration 212 took 54s (32.81% Gen, 65.56% Train). Generation: 17s, Training: 35s. Estimated remaining time: 41h 55m 2s. Estimated total time: 45h 0m 58s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 1s, 500 more iterations: 7h 30m 9s. [2025-11-13 01:15:37,506][__main__][INFO] - Starting iteration 212. [2025-11-13 01:15:37,991][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 01:15:37,992][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:15:44,162][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:15:46,204][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:15:52,191][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:15:55,200][__main__][INFO] - Number of regex retries in iteration 212: 3 [2025-11-13 01:15:55,200][__main__][INFO] - agents played in iteration 212 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:15:56,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:15:56,252][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:15:56,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:15:56,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:15:56,298][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:15:56,300][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:15:56,930][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:15:57,394][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:15:57,900][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:15:58,401][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:15:58,906][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:15:59,408][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:15:59,916][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:16:00,418][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:16:00,920][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:16:01,424][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:16:01,930][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:16:02,432][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:16:02,933][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:16:03,434][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:16:03,937][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:16:04,439][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:16:04,940][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:16:05,444][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:16:05,944][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:16:06,445][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:16:06,953][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:16:07,461][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:16:07,971][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:16:08,477][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:16:08,988][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:16:09,495][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:16:10,002][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:16:10,513][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:16:11,019][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:16:11,524][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:16:12,034][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:16:12,539][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:16:13,044][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:16:13,553][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:16:14,059][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:16:14,564][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:16:15,069][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:16:15,574][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:16:16,078][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:16:16,581][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:16:17,084][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:16:17,587][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:16:18,088][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:16:18,605][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:16:19,107][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:16:19,609][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:16:20,113][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:16:20,618][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:16:21,125][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:16:21,632][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:16:22,137][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:16:22,642][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:16:23,151][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:16:23,660][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:16:24,163][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:16:24,668][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:16:25,176][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:16:25,683][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:16:26,192][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:16:26,701][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:16:27,210][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:16:27,727][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:16:28,234][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:16:28,743][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:16:29,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10873 tokens. [2025-11-13 01:16:29,958][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.29%, Current % of VRAM taken: 58.53%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:33 [2025-11-13 01:16:30,704][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:16:30,706][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:16:30,707][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:16:31,644][__main__][INFO] - Iteration 213 took 53s (32.07% Gen, 66.18% Train). Generation: 17s, Training: 35s. Estimated remaining time: 41h 35m 50s. Estimated total time: 44h 42m 40s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 25s, 500 more iterations: 7h 27m 6s. [2025-11-13 01:16:31,646][__main__][INFO] - Starting iteration 213. [2025-11-13 01:16:32,134][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 01:16:32,135][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:16:46,711][__main__][INFO] - Number of regex retries in iteration 213: 0 [2025-11-13 01:16:46,712][__main__][INFO] - agents played in iteration 213 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:16:47,498][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:16:47,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:16:47,543][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:16:47,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:16:47,566][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:16:47,567][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:16:48,229][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:16:48,689][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:16:49,219][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:16:49,721][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:16:50,223][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:16:50,727][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:16:51,231][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:16:51,736][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:16:52,240][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:16:52,746][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:16:53,254][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:16:53,757][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:16:54,260][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:16:54,763][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:16:55,266][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:16:55,769][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:16:56,273][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:16:56,774][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:16:57,275][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:16:57,776][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:16:58,279][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:16:58,781][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:16:59,286][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:16:59,794][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:17:00,300][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:17:00,810][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:17:01,320][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:17:01,826][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:17:02,340][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:17:02,848][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:17:03,356][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:17:03,861][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:17:04,368][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:17:04,875][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:17:05,379][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:17:05,884][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:17:06,402][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:17:06,905][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:17:07,407][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:17:07,913][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:17:08,415][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:17:08,923][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:17:09,425][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:17:09,928][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:17:10,433][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:17:10,937][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:17:11,441][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:17:11,945][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:17:12,449][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:17:12,956][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:17:13,460][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:17:13,964][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:17:14,468][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:17:14,971][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:17:15,478][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:17:15,981][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:17:16,484][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:17:16,997][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:17:17,498][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:17:18,003][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:17:18,514][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:17:19,018][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:17:19,525][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:17:20,030][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:17:20,541][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10857 tokens. [2025-11-13 01:17:21,280][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 01:17:22,060][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:17:22,061][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:17:22,063][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:17:22,979][__main__][INFO] - Iteration 214 took 50s (28.67% Gen, 69.53% Train). Generation: 14s, Training: 35s. Estimated remaining time: 39h 14m 35s. Estimated total time: 42h 22m 16s. Time estimates for 10 more iterations: 8m 28s, 100 more iterations: 1h 24m 44s, 500 more iterations: 7h 3m 42s. [2025-11-13 01:17:22,982][__main__][INFO] - Starting iteration 214. [2025-11-13 01:17:23,480][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 01:17:23,481][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:17:29,239][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 30 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:17:31,104][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:17:39,513][__main__][INFO] - Number of regex retries in iteration 214: 2 [2025-11-13 01:17:39,514][__main__][INFO] - agents played in iteration 214 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:17:40,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:17:40,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:17:40,340][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:17:40,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:17:40,364][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:17:40,365][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:17:41,067][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:17:41,530][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:17:42,046][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:17:42,551][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:17:43,056][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:17:43,564][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:17:44,066][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:17:44,569][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:17:45,071][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:17:45,572][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:17:46,075][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:17:46,577][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:17:47,078][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:17:47,580][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:17:48,082][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:17:48,585][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:17:49,086][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:17:49,587][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:17:50,100][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:17:50,600][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:17:51,108][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:17:51,610][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:17:52,114][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:17:52,636][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:17:53,142][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:17:53,649][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:17:54,156][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:17:54,661][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:17:55,168][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:17:55,675][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:17:56,182][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:17:56,687][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:17:57,193][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:17:57,706][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:17:58,212][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:17:58,714][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:17:59,220][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:17:59,722][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:18:00,230][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:18:00,733][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:18:01,236][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:18:01,740][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:18:02,243][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:18:02,746][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:18:03,251][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:18:03,756][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:18:04,259][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:18:04,762][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:18:05,266][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:18:05,769][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:18:06,273][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:18:06,776][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:18:07,279][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:18:07,784][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:18:08,290][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:18:08,793][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:18:09,301][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:18:09,810][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:18:10,317][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:18:10,838][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:18:11,343][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:18:11,846][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:18:12,352][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:18:12,859][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:18:13,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10862 tokens. [2025-11-13 01:18:14,029][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.26%, Current % of VRAM taken: 58.51%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:32 [2025-11-13 01:18:14,790][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:18:14,792][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:18:14,793][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:18:15,750][__main__][INFO] - Iteration 215 took 52s (30.67% Gen, 67.49% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 24m 57s. Estimated total time: 43h 33m 31s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 7s, 500 more iterations: 7h 15m 35s. [2025-11-13 01:18:15,753][__main__][INFO] - Starting iteration 215. [2025-11-13 01:18:16,239][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 01:18:16,240][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:18:21,067][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:18:24,100][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:18:32,517][__main__][INFO] - Number of regex retries in iteration 215: 2 [2025-11-13 01:18:32,517][__main__][INFO] - agents played in iteration 215 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:18:33,342][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:18:33,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:18:33,395][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:18:33,417][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:18:33,418][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:18:33,419][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:18:34,068][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:18:34,528][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:18:35,039][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:18:35,553][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:18:36,059][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:18:36,570][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:18:37,075][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:18:37,577][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:18:38,080][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:18:38,583][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:18:39,084][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:18:39,587][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:18:40,090][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:18:40,592][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:18:41,094][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:18:41,596][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:18:42,105][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:18:42,606][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:18:43,106][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:18:43,621][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:18:44,124][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:18:44,636][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:18:45,141][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:18:45,646][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:18:46,157][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:18:46,663][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:18:47,170][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:18:47,677][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:18:48,182][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:18:48,690][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:18:49,196][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:18:49,701][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:18:50,220][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:18:50,724][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:18:51,237][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:18:51,739][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:18:52,241][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:18:52,747][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:18:53,249][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:18:53,751][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:18:54,256][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:18:54,758][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:18:55,270][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:18:55,777][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:18:56,282][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:18:56,785][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:18:57,290][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:18:57,795][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:18:58,299][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:18:58,802][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:18:59,332][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:18:59,836][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:19:00,340][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:19:00,845][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:19:01,352][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:19:01,861][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:19:02,363][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:19:02,871][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:19:03,374][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:19:03,882][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:19:04,389][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:19:04,893][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:19:05,395][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:19:05,909][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:19:06,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10833 tokens. [2025-11-13 01:19:07,081][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.29%, Current % of VRAM taken: 58.53%, Block Peak % of device VRAM: 62.39%, ΔTime: 00:00:33 [2025-11-13 01:19:07,841][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:19:07,843][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:19:07,844][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:19:08,797][__main__][INFO] - Iteration 216 took 52s (30.97% Gen, 67.21% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 38m 28s. Estimated total time: 43h 47m 56s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 35s, 500 more iterations: 7h 17m 59s. [2025-11-13 01:19:08,800][__main__][INFO] - Starting iteration 216. [2025-11-13 01:19:09,290][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 01:19:09,291][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:19:14,061][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:19:26,049][__main__][INFO] - Number of regex retries in iteration 216: 1 [2025-11-13 01:19:26,049][__main__][INFO] - agents played in iteration 216 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:19:26,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:19:26,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:19:26,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:19:26,982][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:19:26,982][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:19:26,983][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:19:27,626][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:19:28,088][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:19:28,597][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:19:29,103][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:19:29,612][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:19:30,118][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:19:30,630][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:19:31,140][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:19:31,649][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:19:32,152][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:19:32,655][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:19:33,164][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:19:33,666][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:19:34,171][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:19:34,677][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:19:35,181][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:19:35,688][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:19:36,191][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:19:36,698][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:19:37,205][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:19:37,710][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:19:38,217][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:19:38,721][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:19:39,226][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:19:39,738][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:19:40,246][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:19:40,756][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:19:41,262][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:19:41,769][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:19:42,277][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:19:42,782][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:19:43,286][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:19:43,791][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:19:44,296][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:19:44,801][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:19:45,306][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:19:45,809][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:19:46,323][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:19:46,823][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:19:47,332][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:19:47,840][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:19:48,344][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:19:48,849][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:19:49,359][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:19:49,866][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:19:50,372][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:19:50,877][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:19:51,381][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:19:51,888][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:19:52,393][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:19:52,899][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:19:53,404][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:19:53,908][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:19:54,423][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:19:54,928][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:19:55,445][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:19:55,949][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:19:56,453][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:19:56,957][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:19:57,461][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:19:57,968][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:19:58,472][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:19:58,976][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:19:59,482][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:19:59,984][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10857 tokens. [2025-11-13 01:20:00,660][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:33 [2025-11-13 01:20:01,430][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:20:01,432][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:20:01,433][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:20:02,361][__main__][INFO] - Iteration 217 took 53s (31.57% Gen, 66.67% Train). Generation: 16s, Training: 35s. Estimated remaining time: 41h 3m 13s. Estimated total time: 44h 13m 34s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 27s, 500 more iterations: 7h 22m 15s. [2025-11-13 01:20:02,363][__main__][INFO] - Starting iteration 217. [2025-11-13 01:20:02,840][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 01:20:02,841][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:20:17,769][__main__][INFO] - Number of regex retries in iteration 217: 0 [2025-11-13 01:20:17,770][__main__][INFO] - agents played in iteration 217 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:20:18,552][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:20:18,579][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:20:18,605][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:20:18,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:20:18,628][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:20:18,629][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:20:19,277][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:20:19,734][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:20:20,249][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:20:20,749][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:20:21,252][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:20:21,758][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:20:22,259][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:20:22,763][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:20:23,265][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:20:23,772][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:20:24,284][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:20:24,791][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:20:25,295][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:20:25,800][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:20:26,308][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:20:26,814][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:20:27,322][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:20:27,850][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:20:28,361][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:20:28,869][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:20:29,377][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:20:29,882][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:20:30,388][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:20:30,897][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:20:31,405][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:20:31,906][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:20:32,417][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:20:32,921][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:20:33,424][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:20:33,927][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:20:34,430][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:20:34,934][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:20:35,438][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:20:35,941][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:20:36,446][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:20:36,952][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:20:37,464][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:20:37,966][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:20:38,468][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:20:38,973][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:20:39,476][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:20:39,980][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:20:40,484][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:20:40,988][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:20:41,493][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:20:41,997][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:20:42,502][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:20:43,007][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:20:43,512][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:20:44,018][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:20:44,523][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:20:45,027][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:20:45,539][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:20:46,045][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:20:46,560][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:20:47,076][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:20:47,587][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:20:48,101][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:20:48,611][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:20:49,114][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:20:49,618][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:20:50,119][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:20:50,620][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:20:51,122][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:20:51,623][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10839 tokens. [2025-11-13 01:20:52,285][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.33%, ΔTime: 00:00:33 [2025-11-13 01:20:53,042][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:20:53,044][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:20:53,045][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:20:53,976][__main__][INFO] - Iteration 218 took 51s (29.19% Gen, 68.98% Train). Generation: 14s, Training: 35s. Estimated remaining time: 39h 25m 38s. Estimated total time: 42h 36m 50s. Time estimates for 10 more iterations: 8m 31s, 100 more iterations: 1h 25m 13s, 500 more iterations: 7h 6m 8s. [2025-11-13 01:20:53,978][__main__][INFO] - Starting iteration 218. [2025-11-13 01:20:54,460][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 01:20:54,460][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:20:58,629][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:21:09,697][__main__][INFO] - Number of regex retries in iteration 218: 1 [2025-11-13 01:21:09,698][__main__][INFO] - agents played in iteration 218 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:21:10,570][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:21:10,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:21:10,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:21:10,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:21:10,643][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:21:10,643][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:21:11,285][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:21:11,754][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:21:12,264][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:21:12,769][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:21:13,277][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:21:13,782][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:21:14,286][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:21:14,791][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:21:15,297][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:21:15,807][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:21:16,313][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:21:16,819][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:21:17,328][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:21:17,836][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:21:18,346][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:21:18,862][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:21:19,369][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:21:19,878][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:21:20,382][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:21:20,885][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:21:21,395][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:21:21,898][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:21:22,402][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:21:22,905][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:21:23,411][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:21:23,918][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:21:24,424][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:21:24,930][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:21:25,439][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:21:25,945][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:21:26,457][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:21:26,961][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:21:27,466][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:21:27,972][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:21:28,477][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:21:28,983][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:21:29,487][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:21:29,991][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:21:30,500][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:21:31,003][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:21:31,507][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:21:32,011][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:21:32,512][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:21:33,015][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:21:33,518][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:21:34,022][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:21:34,531][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:21:35,041][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:21:35,544][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:21:36,049][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:21:36,555][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:21:37,084][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:21:37,588][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:21:38,091][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:21:38,597][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:21:39,102][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:21:39,610][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:21:40,115][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:21:40,618][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:21:41,119][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:21:41,621][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:21:42,122][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:21:42,623][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:21:43,125][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:21:43,627][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10867 tokens. [2025-11-13 01:21:44,298][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:33 [2025-11-13 01:21:45,077][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:21:45,078][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:21:45,080][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:21:46,001][__main__][INFO] - Iteration 219 took 51s (29.56% Gen, 68.65% Train). Generation: 15s, Training: 35s. Estimated remaining time: 39h 45m 0s. Estimated total time: 42h 57m 4s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 54s, 500 more iterations: 7h 9m 30s. [2025-11-13 01:21:46,004][__main__][INFO] - Starting iteration 219. [2025-11-13 01:21:46,491][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 01:21:46,491][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:22:03,249][__main__][INFO] - Number of regex retries in iteration 219: 0 [2025-11-13 01:22:03,249][__main__][INFO] - agents played in iteration 219 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:22:04,095][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:22:04,122][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:22:04,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:22:04,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:22:04,172][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:22:04,172][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:22:04,819][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:22:05,276][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:22:05,781][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:22:06,285][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:22:06,789][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:22:07,305][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:22:07,812][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:22:08,316][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:22:08,823][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:22:09,328][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:22:09,835][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:22:10,341][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:22:10,844][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:22:11,350][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:22:11,856][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:22:12,364][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:22:12,880][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:22:13,385][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:22:13,904][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:22:14,411][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:22:14,917][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:22:15,423][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:22:15,930][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:22:16,437][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:22:16,942][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:22:17,447][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:22:17,961][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:22:18,465][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:22:18,975][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:22:19,479][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:22:19,983][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:22:20,504][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:22:21,007][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:22:21,538][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:22:22,043][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:22:22,548][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:22:23,057][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:22:23,563][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:22:24,068][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:22:24,588][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:22:25,094][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:22:25,603][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:22:26,108][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:22:26,613][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:22:27,118][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:22:27,622][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:22:28,125][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:22:28,629][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:22:29,132][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:22:29,636][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:22:30,138][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:22:30,640][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:22:31,145][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:22:31,646][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:22:32,148][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:22:32,651][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:22:33,155][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:22:33,661][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:22:34,165][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:22:34,667][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:22:35,180][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:22:35,681][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:22:36,198][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:22:36,700][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:22:37,201][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10828 tokens. [2025-11-13 01:22:37,860][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 01:22:38,631][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:22:38,633][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:22:38,634][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:22:39,539][__main__][INFO] - Iteration 220 took 53s (31.59% Gen, 66.70% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 59m 29s. Estimated total time: 44h 12m 27s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 24s, 500 more iterations: 7h 22m 4s. [2025-11-13 01:22:39,541][__main__][INFO] - Starting iteration 220. [2025-11-13 01:22:40,026][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 21 and human policies 1. [2025-11-13 01:22:40,026][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:22:45,014][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:22:54,725][__main__][INFO] - Number of regex retries in iteration 220: 1 [2025-11-13 01:22:54,726][__main__][INFO] - agents played in iteration 220 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:22:55,500][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:22:55,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:22:55,547][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:22:55,569][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:22:55,569][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:22:55,570][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:22:56,214][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:22:56,674][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:22:57,182][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:22:57,686][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:22:58,188][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:22:58,694][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:22:59,199][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:22:59,703][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:23:00,219][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:23:00,723][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:23:01,230][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:23:01,742][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:23:02,247][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:23:02,751][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:23:03,257][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:23:03,763][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:23:04,271][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:23:04,777][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:23:05,282][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:23:05,787][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:23:06,294][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:23:06,817][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:23:07,323][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:23:07,828][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:23:08,336][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:23:08,846][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:23:09,352][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:23:09,857][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:23:10,362][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:23:10,868][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:23:11,374][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:23:11,878][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:23:12,382][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:23:12,885][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:23:13,404][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:23:13,909][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:23:14,413][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:23:14,919][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:23:15,422][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:23:15,926][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:23:16,430][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:23:16,937][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:23:17,444][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:23:17,950][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:23:18,456][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:23:18,963][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:23:19,467][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:23:19,979][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:23:20,482][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:23:20,985][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:23:21,494][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:23:21,997][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:23:22,499][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:23:23,000][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:23:23,503][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:23:24,007][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:23:24,511][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:23:25,013][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:23:25,515][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:23:26,019][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:23:26,526][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:23:27,026][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:23:27,526][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:23:28,030][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:23:28,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10861 tokens. [2025-11-13 01:23:29,213][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 01:23:29,984][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:23:29,985][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:23:29,987][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:23:31,800][__main__][INFO] - Iteration 221 took 51s (28.39% Gen, 68.10% Train). Generation: 14s, Training: 35s. Estimated remaining time: 39h 54m 56s. Estimated total time: 43h 8m 46s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 17s, 500 more iterations: 7h 11m 27s. [2025-11-13 01:23:31,803][__main__][INFO] - Starting iteration 221. [2025-11-13 01:23:32,311][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 01:23:32,311][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:23:43,414][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls Given the per-item values, I aim to maximize my points. Since my values for hats and books are significantly higher compared to Alice’s, proposing to keep all hats maximizes my potential points from this round. Books, while valuable to me, are not as important as hats in this round, especially since Alice values them much more. Balls, with the lowest value for me, do not contribute meaningfully to my total points, so I propose to keep none. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:23:43,934][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:23:48,720][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:23:50,672][__main__][INFO] - Number of regex retries in iteration 221: 3 [2025-11-13 01:23:50,673][__main__][INFO] - agents played in iteration 221 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:23:51,484][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:23:51,513][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:23:51,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:23:51,562][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:23:51,562][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:23:51,563][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:23:52,247][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:23:52,708][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:23:53,217][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:23:53,721][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:23:54,230][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:23:54,737][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:23:55,243][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:23:55,745][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:23:56,251][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:23:56,762][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:23:57,269][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:23:57,777][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:23:58,286][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:23:58,798][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:23:59,306][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:23:59,823][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:24:00,329][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:24:00,838][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:24:01,345][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:24:01,854][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:24:02,361][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:24:02,867][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:24:03,372][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:24:03,876][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:24:04,380][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:24:04,884][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:24:05,386][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:24:05,888][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:24:06,389][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:24:06,896][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:24:07,404][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:24:07,906][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:24:08,410][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:24:08,945][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:24:09,459][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:24:09,969][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:24:10,474][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:24:10,978][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:24:11,482][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:24:11,987][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:24:12,491][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:24:12,995][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:24:13,501][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:24:14,019][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:24:14,524][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:24:15,028][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:24:15,538][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:24:16,041][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:24:16,547][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:24:17,051][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:24:17,554][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:24:18,059][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:24:18,562][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:24:19,065][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:24:19,572][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:24:20,077][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:24:20,582][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:24:21,084][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:24:21,586][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:24:22,089][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:24:22,591][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:24:23,094][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:24:23,596][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:24:24,097][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:24:24,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10863 tokens. [2025-11-13 01:24:25,254][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.40%, ΔTime: 00:00:33 [2025-11-13 01:24:26,040][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:24:26,042][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:24:26,043][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:24:26,947][__main__][INFO] - Iteration 222 took 54s (33.61% Gen, 64.74% Train). Generation: 18s, Training: 35s. Estimated remaining time: 42h 17m 3s. Estimated total time: 45h 31m 48s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 3s, 500 more iterations: 7h 35m 18s. [2025-11-13 01:24:26,949][__main__][INFO] - Starting iteration 222. [2025-11-13 01:24:27,436][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 01:24:27,437][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:24:43,876][__main__][INFO] - Number of regex retries in iteration 222: 0 [2025-11-13 01:24:43,877][__main__][INFO] - agents played in iteration 222 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:24:44,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:24:44,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:24:44,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:24:44,804][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:24:44,805][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:24:44,806][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:24:45,500][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:24:45,962][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:24:46,472][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:24:46,981][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:24:47,486][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:24:47,994][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:24:48,500][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:24:49,007][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:24:49,518][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:24:50,023][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:24:50,528][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:24:51,033][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:24:51,540][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:24:52,053][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:24:52,559][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:24:53,063][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:24:53,571][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:24:54,075][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:24:54,579][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:24:55,082][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:24:55,587][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:24:56,087][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:24:56,591][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:24:57,093][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:24:57,597][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:24:58,101][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:24:58,605][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:24:59,109][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:24:59,615][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:25:00,132][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:25:00,636][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:25:01,146][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:25:01,650][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:25:02,152][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:25:02,663][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:25:03,166][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:25:03,668][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:25:04,170][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:25:04,673][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:25:05,180][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:25:05,684][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:25:06,186][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:25:06,687][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:25:07,189][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:25:07,691][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:25:08,192][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:25:08,696][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:25:09,199][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:25:09,700][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:25:10,202][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:25:10,704][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:25:11,211][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:25:11,729][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:25:12,230][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:25:12,733][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:25:13,238][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:25:13,740][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:25:14,252][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:25:14,756][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:25:15,259][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:25:15,764][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:25:16,267][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:25:16,770][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:25:17,275][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:25:17,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10854 tokens. [2025-11-13 01:25:18,460][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.28%, Current % of VRAM taken: 58.52%, Block Peak % of device VRAM: 62.33%, ΔTime: 00:00:32 [2025-11-13 01:25:19,220][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:25:19,222][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:25:19,224][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:25:20,198][__main__][INFO] - Iteration 223 took 52s (31.16% Gen, 66.99% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 42m 27s. Estimated total time: 43h 58m 6s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 56s, 500 more iterations: 7h 19m 41s. [2025-11-13 01:25:20,200][__main__][INFO] - Starting iteration 223. [2025-11-13 01:25:20,703][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 01:25:20,703][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:25:25,741][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:25:36,569][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:25:37,336][__main__][INFO] - Number of regex retries in iteration 223: 2 [2025-11-13 01:25:37,336][__main__][INFO] - agents played in iteration 223 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:25:38,155][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:25:38,180][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:25:38,203][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:25:38,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:25:38,226][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:25:38,227][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:25:38,937][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:25:39,402][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:25:39,912][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:25:40,437][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:25:40,944][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:25:41,447][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:25:41,955][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:25:42,465][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:25:42,972][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:25:43,479][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:25:43,985][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:25:44,491][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:25:44,998][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:25:45,501][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:25:46,011][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:25:46,515][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:25:47,029][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:25:47,534][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:25:48,042][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:25:48,550][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:25:49,055][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:25:49,561][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:25:50,070][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:25:50,577][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:25:51,086][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:25:51,590][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:25:52,092][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:25:52,593][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:25:53,096][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:25:53,601][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:25:54,101][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:25:54,603][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:25:55,103][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:25:55,605][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:25:56,106][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:25:56,607][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:25:57,112][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:25:57,619][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:25:58,122][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:25:58,623][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:25:59,124][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:25:59,624][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:26:00,133][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:26:00,636][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:26:01,138][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:26:01,646][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:26:02,146][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:26:02,656][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:26:03,159][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:26:03,661][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:26:04,170][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:26:04,671][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:26:05,175][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:26:05,677][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:26:06,179][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:26:06,684][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:26:07,186][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:26:07,686][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:26:08,186][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:26:08,687][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:26:09,189][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:26:09,689][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:26:10,189][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:26:10,689][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:26:11,190][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10864 tokens. [2025-11-13 01:26:11,857][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:32 [2025-11-13 01:26:12,632][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:26:12,633][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:26:12,635][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:26:13,544][__main__][INFO] - Iteration 224 took 52s (31.48% Gen, 66.80% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 45m 34s. Estimated total time: 44h 2m 5s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 4s, 500 more iterations: 7h 20m 20s. [2025-11-13 01:26:13,546][__main__][INFO] - Starting iteration 224. [2025-11-13 01:26:14,047][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 01:26:14,048][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:26:24,911][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 11 books, 9 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:26:29,513][__main__][INFO] - Number of regex retries in iteration 224: 1 [2025-11-13 01:26:29,514][__main__][INFO] - agents played in iteration 224 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:26:30,373][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:26:30,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:26:30,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:26:30,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:26:30,443][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:26:30,444][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:26:31,128][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:26:31,600][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:26:32,139][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:26:32,649][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:26:33,161][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:26:33,672][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:26:34,185][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:26:34,693][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:26:35,203][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:26:35,709][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:26:36,213][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:26:36,721][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:26:37,225][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:26:37,729][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:26:38,233][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:26:38,737][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:26:39,247][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:26:39,751][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:26:40,257][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:26:40,764][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:26:41,270][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:26:41,775][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:26:42,280][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:26:42,782][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:26:43,291][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:26:43,793][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:26:44,297][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:26:44,799][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:26:45,302][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:26:45,810][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:26:46,313][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:26:46,820][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:26:47,329][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:26:47,834][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:26:48,343][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:26:48,849][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:26:49,349][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:26:49,853][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:26:50,356][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:26:50,858][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:26:51,360][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:26:51,860][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:26:52,369][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:26:52,869][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:26:53,370][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:26:53,879][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:26:54,382][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:26:54,882][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:26:55,383][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:26:55,886][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:26:56,388][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:26:56,895][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:26:57,396][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:26:57,900][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:26:58,401][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:26:58,902][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:26:59,403][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:26:59,904][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:27:00,406][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:27:00,907][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:27:01,404][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:27:01,907][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:27:02,408][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:27:02,909][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:27:03,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10800 tokens. [2025-11-13 01:27:04,084][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.56%, ΔTime: 00:00:32 [2025-11-13 01:27:04,850][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:27:04,852][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:27:04,853][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:27:05,787][__main__][INFO] - Iteration 225 took 51s (29.89% Gen, 68.30% Train). Generation: 15s, Training: 35s. Estimated remaining time: 39h 49m 36s. Estimated total time: 43h 7m 0s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 14s, 500 more iterations: 7h 11m 10s. [2025-11-13 01:27:05,789][__main__][INFO] - Starting iteration 225. [2025-11-13 01:27:06,258][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 01:27:06,258][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:27:11,662][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:27:12,523][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 30 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 2/3 [2025-11-13 01:27:22,224][__main__][INFO] - Number of regex retries in iteration 225: 2 [2025-11-13 01:27:22,225][__main__][INFO] - agents played in iteration 225 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:27:23,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:27:23,123][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:27:23,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:27:23,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:27:23,168][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:27:23,169][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:27:23,870][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:27:24,338][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:27:24,846][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:27:25,357][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:27:25,861][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:27:26,366][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:27:26,872][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:27:27,377][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:27:27,891][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:27:28,403][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:27:28,909][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:27:29,419][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:27:29,926][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:27:30,438][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:27:30,952][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:27:31,460][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:27:31,977][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:27:32,485][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:27:32,994][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:27:33,499][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:27:34,004][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:27:34,511][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:27:35,018][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:27:35,523][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:27:36,027][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:27:36,534][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:27:37,040][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:27:37,548][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:27:38,066][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:27:38,576][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:27:39,085][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:27:39,595][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:27:40,107][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:27:40,613][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:27:41,118][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:27:41,622][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:27:42,123][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:27:42,625][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:27:43,127][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:27:43,635][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:27:44,145][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:27:44,653][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:27:45,173][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:27:45,676][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:27:46,180][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:27:46,686][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:27:47,192][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:27:47,695][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:27:48,200][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:27:48,702][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:27:49,205][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:27:49,706][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:27:50,210][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:27:50,712][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:27:51,213][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:27:51,714][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:27:52,217][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:27:52,721][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:27:53,230][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:27:53,733][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:27:54,237][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:27:54,739][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:27:55,241][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:27:55,758][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:27:56,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10837 tokens. [2025-11-13 01:27:56,939][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.26%, Current % of VRAM taken: 58.51%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:33 [2025-11-13 01:27:57,701][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:27:57,703][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:27:57,705][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:27:58,683][__main__][INFO] - Iteration 226 took 52s (30.46% Gen, 67.68% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 23m 0s. Estimated total time: 43h 41m 18s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 22s, 500 more iterations: 7h 16m 53s. [2025-11-13 01:27:58,686][__main__][INFO] - Starting iteration 226. [2025-11-13 01:27:59,156][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 01:27:59,157][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:28:14,461][__main__][INFO] - Number of regex retries in iteration 226: 0 [2025-11-13 01:28:14,462][__main__][INFO] - agents played in iteration 226 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:28:15,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:28:15,360][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:28:15,382][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:28:15,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:28:15,405][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:28:15,406][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:28:16,115][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:28:16,582][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:28:17,092][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:28:17,598][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:28:18,111][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:28:18,614][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:28:19,122][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:28:19,626][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:28:20,133][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:28:20,641][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:28:21,149][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:28:21,653][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:28:22,178][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:28:22,685][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:28:23,190][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:28:23,697][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:28:24,204][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:28:24,711][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:28:25,217][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:28:25,722][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:28:26,226][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:28:26,731][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:28:27,235][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:28:27,742][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:28:28,244][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:28:28,765][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:28:29,269][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:28:29,774][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:28:30,280][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:28:30,785][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:28:31,296][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:28:31,803][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:28:32,308][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:28:32,816][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:28:33,326][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:28:33,832][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:28:34,340][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:28:34,848][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:28:35,354][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:28:35,857][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:28:36,361][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:28:36,863][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:28:37,366][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:28:37,884][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:28:38,384][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:28:38,888][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:28:39,390][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:28:39,892][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:28:40,398][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:28:40,902][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:28:41,406][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:28:41,912][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:28:42,417][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:28:42,922][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:28:43,426][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:28:43,930][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:28:44,435][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:28:44,939][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:28:45,442][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:28:45,946][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:28:46,450][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:28:46,956][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:28:47,469][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:28:47,973][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:28:48,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10874 tokens. [2025-11-13 01:28:49,172][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.38%, ΔTime: 00:00:33 [2025-11-13 01:28:49,960][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:28:49,962][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:28:49,964][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:28:50,880][__main__][INFO] - Iteration 227 took 51s (29.59% Gen, 68.64% Train). Generation: 15s, Training: 35s. Estimated remaining time: 39h 47m 3s. Estimated total time: 43h 6m 13s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 12s, 500 more iterations: 7h 11m 2s. [2025-11-13 01:28:50,882][__main__][INFO] - Starting iteration 227. [2025-11-13 01:28:51,350][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 01:28:51,350][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:28:55,332][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:29:05,825][__main__][INFO] - Number of regex retries in iteration 227: 1 [2025-11-13 01:29:05,825][__main__][INFO] - agents played in iteration 227 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:29:06,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:29:06,643][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:29:06,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:29:06,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:29:06,701][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:29:06,702][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:29:07,416][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:29:07,881][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:29:08,394][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:29:08,904][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:29:09,410][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:29:09,918][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:29:10,424][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:29:10,938][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:29:11,446][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:29:11,952][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:29:12,467][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:29:12,973][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:29:13,488][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:29:13,996][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:29:14,502][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:29:15,009][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:29:15,515][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:29:16,020][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:29:16,523][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:29:17,027][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:29:17,534][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:29:18,039][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:29:18,544][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:29:19,047][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:29:19,549][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:29:20,060][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:29:20,564][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:29:21,068][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:29:21,573][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:29:22,077][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:29:22,601][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:29:23,105][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:29:23,610][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:29:24,112][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:29:24,616][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:29:25,122][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:29:25,626][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:29:26,130][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:29:26,637][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:29:27,143][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:29:27,647][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:29:28,149][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:29:28,655][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:29:29,159][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:29:29,660][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:29:30,162][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:29:30,678][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:29:31,179][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:29:31,685][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:29:32,185][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:29:32,685][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:29:33,189][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:29:33,690][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:29:34,190][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:29:34,689][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:29:35,190][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:29:35,706][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:29:36,207][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:29:36,708][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:29:37,210][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:29:37,711][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:29:38,215][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:29:38,714][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:29:39,215][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:29:39,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10822 tokens. [2025-11-13 01:29:40,372][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.33%, ΔTime: 00:00:32 [2025-11-13 01:29:41,163][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:29:41,165][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:29:41,166][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:29:42,081][__main__][INFO] - Iteration 228 took 50s (28.53% Gen, 69.66% Train). Generation: 14s, Training: 35s. Estimated remaining time: 38h 56m 34s. Estimated total time: 42h 16m 35s. Time estimates for 10 more iterations: 8m 27s, 100 more iterations: 1h 24m 33s, 500 more iterations: 7h 2m 45s. [2025-11-13 01:29:42,083][__main__][INFO] - Starting iteration 228. [2025-11-13 01:29:42,565][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 01:29:42,566][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:29:57,628][__main__][INFO] - Number of regex retries in iteration 228: 0 [2025-11-13 01:29:57,629][__main__][INFO] - agents played in iteration 228 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:29:58,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:29:58,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:29:58,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:29:58,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:29:58,471][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:29:58,472][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:29:59,202][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:29:59,662][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:30:00,172][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:30:00,677][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:30:01,183][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:30:01,704][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:30:02,209][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:30:02,715][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:30:03,225][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:30:03,731][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:30:04,245][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:30:04,751][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:30:05,256][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:30:05,765][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:30:06,272][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:30:06,778][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:30:07,283][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:30:07,788][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:30:08,294][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:30:08,800][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:30:09,306][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:30:09,810][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:30:10,312][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:30:10,826][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:30:11,329][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:30:11,830][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:30:12,334][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:30:12,837][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:30:13,339][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:30:13,843][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:30:14,350][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:30:14,857][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:30:15,363][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:30:15,866][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:30:16,374][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:30:16,878][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:30:17,401][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:30:17,905][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:30:18,408][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:30:18,915][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:30:19,421][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:30:19,926][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:30:20,431][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:30:20,936][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:30:21,441][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:30:21,948][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:30:22,457][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:30:22,963][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:30:23,470][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:30:23,973][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:30:24,476][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:30:24,979][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:30:25,492][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:30:25,995][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:30:26,497][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:30:27,000][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:30:27,502][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:30:28,006][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:30:28,507][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:30:29,010][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:30:29,513][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:30:30,015][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:30:30,520][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:30:31,021][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:30:31,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10861 tokens. [2025-11-13 01:30:32,198][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.10%, ΔTime: 00:00:33 [2025-11-13 01:30:32,991][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:30:32,992][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:30:32,994][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:30:33,916][__main__][INFO] - Iteration 229 took 51s (29.33% Gen, 68.87% Train). Generation: 15s, Training: 35s. Estimated remaining time: 39h 26m 42s. Estimated total time: 42h 47m 34s. Time estimates for 10 more iterations: 8m 33s, 100 more iterations: 1h 25m 35s, 500 more iterations: 7h 7m 55s. [2025-11-13 01:30:33,918][__main__][INFO] - Starting iteration 229. [2025-11-13 01:30:34,406][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 01:30:34,406][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:30:50,915][__main__][INFO] - Number of regex retries in iteration 229: 0 [2025-11-13 01:30:50,916][__main__][INFO] - agents played in iteration 229 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:30:51,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:30:51,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:30:51,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:30:51,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:30:51,823][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:30:51,824][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:30:52,519][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:30:52,980][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:30:53,488][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:30:53,991][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:30:54,496][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:30:55,001][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:30:55,505][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:30:56,023][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:30:56,530][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:30:57,039][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:30:57,546][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:30:58,053][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:30:58,563][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:30:59,069][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:30:59,575][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:31:00,091][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:31:00,598][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:31:01,103][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:31:01,607][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:31:02,109][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:31:02,630][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:31:03,132][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:31:03,637][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:31:04,138][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:31:04,642][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:31:05,152][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:31:05,657][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:31:06,161][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:31:06,666][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:31:07,171][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:31:07,675][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:31:08,180][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:31:08,683][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:31:09,196][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:31:09,702][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:31:10,206][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:31:10,712][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:31:11,218][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:31:11,730][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:31:12,236][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:31:12,741][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:31:13,248][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:31:13,751][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:31:14,255][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:31:14,757][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:31:15,260][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:31:15,766][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:31:16,269][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:31:16,773][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:31:17,273][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:31:17,773][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:31:18,276][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:31:18,776][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:31:19,277][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:31:19,778][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:31:20,279][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:31:20,781][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:31:21,287][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:31:21,788][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:31:22,295][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:31:22,797][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:31:23,300][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:31:23,807][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:31:24,314][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:31:24,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10861 tokens. [2025-11-13 01:31:25,492][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:32 [2025-11-13 01:31:26,257][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:31:26,258][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:31:26,260][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:31:27,257][__main__][INFO] - Iteration 230 took 52s (31.24% Gen, 66.88% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 40m 48s. Estimated total time: 44h 2m 34s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 5s, 500 more iterations: 7h 20m 25s. [2025-11-13 01:31:27,259][__main__][INFO] - Starting iteration 230. [2025-11-13 01:31:27,742][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 22 and human policies 1. [2025-11-13 01:31:27,743][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:31:44,125][__main__][INFO] - Number of regex retries in iteration 230: 0 [2025-11-13 01:31:44,126][__main__][INFO] - agents played in iteration 230 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:31:44,975][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:31:45,000][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:31:45,024][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:31:45,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:31:45,047][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:31:45,048][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:31:45,733][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:31:46,201][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:31:46,707][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:31:47,213][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:31:47,721][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:31:48,225][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:31:48,736][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:31:49,242][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:31:49,747][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:31:50,260][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:31:50,771][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:31:51,277][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:31:51,784][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:31:52,293][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:31:52,806][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:31:53,312][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:31:53,819][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:31:54,326][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:31:54,831][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:31:55,340][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:31:55,844][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:31:56,347][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:31:56,850][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:31:57,358][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:31:57,863][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:31:58,368][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:31:58,874][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:31:59,381][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:31:59,885][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:32:00,391][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:32:00,902][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:32:01,406][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:32:01,923][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:32:02,426][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:32:02,931][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:32:03,438][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:32:03,947][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:32:04,454][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:32:04,959][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:32:05,473][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:32:05,981][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:32:06,486][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:32:06,988][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:32:07,494][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:32:07,999][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:32:08,506][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:32:09,020][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:32:09,522][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:32:10,043][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:32:10,549][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:32:11,076][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:32:11,580][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:32:12,083][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:32:12,588][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:32:13,091][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:32:13,593][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:32:14,100][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:32:14,599][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:32:15,101][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:32:15,617][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:32:16,119][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:32:16,635][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:32:17,136][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:32:17,639][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:32:18,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10861 tokens. [2025-11-13 01:32:18,796][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.38%, ΔTime: 00:00:33 [2025-11-13 01:32:19,603][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:32:19,604][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:32:19,606][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:32:21,392][__main__][INFO] - Iteration 231 took 53s (30.54% Gen, 66.13% Train). Generation: 16s, Training: 35s. Estimated remaining time: 41h 19m 50s. Estimated total time: 44h 42m 30s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 25s, 500 more iterations: 7h 27m 5s. [2025-11-13 01:32:21,394][__main__][INFO] - Starting iteration 231. [2025-11-13 01:32:21,867][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 01:32:21,868][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:32:27,117][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:32:37,246][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:32:39,785][__main__][INFO] - Number of regex retries in iteration 231: 2 [2025-11-13 01:32:39,786][__main__][INFO] - agents played in iteration 231 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:32:40,633][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:32:40,655][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:32:40,677][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:32:40,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:32:40,700][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:32:40,701][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:32:41,410][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:32:41,873][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:32:42,384][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:32:42,892][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:32:43,394][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:32:43,897][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:32:44,399][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:32:44,903][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:32:45,407][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:32:45,915][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:32:46,420][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:32:46,928][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:32:47,433][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:32:47,942][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:32:48,448][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:32:48,954][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:32:49,462][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:32:49,970][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:32:50,475][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:32:50,982][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:32:51,487][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:32:51,992][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:32:52,497][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:32:53,002][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:32:53,507][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:32:54,031][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:32:54,540][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:32:55,045][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:32:55,550][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:32:56,059][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:32:56,564][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:32:57,072][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:32:57,583][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:32:58,088][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:32:58,590][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:32:59,096][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:32:59,600][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:33:00,106][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:33:00,612][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:33:01,117][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:33:01,620][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:33:02,127][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:33:02,645][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:33:03,151][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:33:03,654][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:33:04,157][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:33:04,659][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:33:05,164][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:33:05,664][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:33:06,167][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:33:06,675][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:33:07,177][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:33:07,681][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:33:08,184][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:33:08,684][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:33:09,186][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:33:09,689][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:33:10,189][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:33:10,690][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:33:11,191][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:33:11,693][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:33:12,195][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:33:12,698][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:33:13,201][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:33:13,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10857 tokens. [2025-11-13 01:33:14,374][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.39%, ΔTime: 00:00:32 [2025-11-13 01:33:15,141][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:33:15,142][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:33:15,144][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:33:16,125][__main__][INFO] - Iteration 232 took 54s (33.02% Gen, 65.16% Train). Generation: 17s, Training: 35s. Estimated remaining time: 41h 49m 21s. Estimated total time: 45h 12m 55s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 25s, 500 more iterations: 7h 32m 9s. [2025-11-13 01:33:16,128][__main__][INFO] - Starting iteration 232. [2025-11-13 01:33:16,607][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 01:33:16,608][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:33:32,767][__main__][INFO] - Number of regex retries in iteration 232: 0 [2025-11-13 01:33:32,767][__main__][INFO] - agents played in iteration 232 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:33:33,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:33:33,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:33:33,649][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:33:33,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:33:33,672][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:33:33,672][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:33:34,392][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:33:34,857][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:33:35,369][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:33:35,877][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:33:36,384][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:33:36,896][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:33:37,404][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:33:37,910][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:33:38,416][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:33:38,925][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:33:39,432][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:33:39,936][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:33:40,441][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:33:40,945][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:33:41,451][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:33:41,967][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:33:42,472][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:33:42,974][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:33:43,478][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:33:43,980][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:33:44,487][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:33:44,989][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:33:45,491][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:33:45,995][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:33:46,498][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:33:47,000][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:33:47,502][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:33:48,007][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:33:48,513][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:33:49,015][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:33:49,517][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:33:50,038][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:33:50,544][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:33:51,050][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:33:51,553][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:33:52,057][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:33:52,571][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:33:53,075][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:33:53,578][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:33:54,082][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:33:54,585][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:33:55,088][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:33:55,590][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:33:56,092][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:33:56,598][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:33:57,103][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:33:57,608][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:33:58,111][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:33:58,611][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:33:59,117][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:33:59,621][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:34:00,122][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:34:00,638][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:34:01,140][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:34:01,642][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:34:02,143][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:34:02,644][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:34:03,160][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:34:03,660][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:34:04,160][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:34:04,660][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:34:05,161][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:34:05,669][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:34:06,169][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:34:06,670][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10859 tokens. [2025-11-13 01:34:07,337][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:32 [2025-11-13 01:34:08,133][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:34:08,135][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:34:08,136][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:34:09,033][__main__][INFO] - Iteration 233 took 52s (30.82% Gen, 67.46% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 16m 50s. Estimated total time: 43h 41m 17s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 22s, 500 more iterations: 7h 16m 52s. [2025-11-13 01:34:09,035][__main__][INFO] - Starting iteration 233. [2025-11-13 01:34:09,527][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 01:34:09,528][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:34:14,109][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:34:25,403][__main__][INFO] - Number of regex retries in iteration 233: 1 [2025-11-13 01:34:25,403][__main__][INFO] - agents played in iteration 233 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:34:26,244][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:34:26,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:34:26,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:34:26,312][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:34:26,312][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:34:26,313][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:34:27,013][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:34:27,476][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:34:27,991][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:34:28,496][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:34:29,007][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:34:29,512][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:34:30,019][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:34:30,531][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:34:31,041][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:34:31,550][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:34:32,062][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:34:32,573][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:34:33,088][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:34:33,590][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:34:34,096][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:34:34,603][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:34:35,106][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:34:35,616][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:34:36,121][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:34:36,626][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:34:37,138][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:34:37,643][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:34:38,154][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:34:38,658][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:34:39,163][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:34:39,668][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:34:40,172][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:34:40,678][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:34:41,182][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:34:41,685][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:34:42,191][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:34:42,696][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:34:43,198][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:34:43,704][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:34:44,211][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:34:44,719][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:34:45,226][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:34:45,730][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:34:46,233][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:34:46,737][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:34:47,241][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:34:47,743][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:34:48,246][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:34:48,762][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:34:49,266][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:34:49,770][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:34:50,272][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:34:50,773][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:34:51,282][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:34:51,791][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:34:52,299][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:34:52,809][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:34:53,315][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:34:53,819][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:34:54,322][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:34:54,825][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:34:55,328][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:34:55,884][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:34:56,389][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:34:56,890][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:34:57,396][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:34:57,900][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:34:58,401][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:34:58,904][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:34:59,406][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10849 tokens. [2025-11-13 01:35:00,063][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.50%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:33 [2025-11-13 01:35:00,858][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:35:00,860][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:35:00,861][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:35:01,769][__main__][INFO] - Iteration 234 took 52s (30.39% Gen, 67.87% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 6m 48s. Estimated total time: 43h 32m 8s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 4s, 500 more iterations: 7h 15m 21s. [2025-11-13 01:35:01,772][__main__][INFO] - Starting iteration 234. [2025-11-13 01:35:02,240][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 01:35:02,240][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:35:07,030][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:35:18,024][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:35:18,998][__main__][INFO] - Number of regex retries in iteration 234: 2 [2025-11-13 01:35:18,998][__main__][INFO] - agents played in iteration 234 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:35:19,840][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:35:19,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:35:19,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:35:19,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:35:19,912][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:35:19,913][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:35:20,626][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:35:21,088][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:35:21,603][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:35:22,107][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:35:22,612][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:35:23,117][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:35:23,623][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:35:24,131][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:35:24,636][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:35:25,143][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:35:25,658][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:35:26,164][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:35:26,683][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:35:27,189][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:35:27,692][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:35:28,202][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:35:28,707][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:35:29,214][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:35:29,718][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:35:30,224][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:35:30,730][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:35:31,235][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:35:31,738][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:35:32,250][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:35:32,759][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:35:33,267][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:35:33,771][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:35:34,276][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:35:34,785][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:35:35,290][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:35:35,794][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:35:36,301][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:35:36,806][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:35:37,313][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:35:37,818][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:35:38,322][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:35:38,824][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:35:39,327][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:35:39,830][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:35:40,333][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:35:40,838][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:35:41,341][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:35:41,846][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:35:42,348][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:35:42,862][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:35:43,363][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:35:43,879][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:35:44,382][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:35:44,885][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:35:45,401][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:35:45,903][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:35:46,405][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:35:46,911][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:35:47,415][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:35:47,920][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:35:48,420][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:35:48,922][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:35:49,428][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:35:49,931][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:35:50,435][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:35:50,941][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:35:51,443][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:35:51,946][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:35:52,447][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:35:52,949][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10868 tokens. [2025-11-13 01:35:53,621][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 01:35:54,414][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:35:54,415][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:35:54,417][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:35:55,335][__main__][INFO] - Iteration 235 took 53s (31.56% Gen, 66.71% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 48m 35s. Estimated total time: 44h 14m 48s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 29s, 500 more iterations: 7h 22m 28s. [2025-11-13 01:35:55,337][__main__][INFO] - Starting iteration 235. [2025-11-13 01:35:55,944][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 01:35:55,945][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:36:11,222][__main__][INFO] - Number of regex retries in iteration 235: 0 [2025-11-13 01:36:11,223][__main__][INFO] - agents played in iteration 235 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:36:12,067][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:36:12,093][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:36:12,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:36:12,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:36:12,142][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:36:12,142][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:36:12,875][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:36:13,337][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:36:13,847][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:36:14,362][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:36:14,868][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:36:15,388][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:36:15,893][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:36:16,397][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:36:16,906][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:36:17,412][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:36:17,915][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:36:18,420][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:36:18,924][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:36:19,432][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:36:19,941][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:36:20,446][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:36:20,966][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:36:21,471][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:36:21,978][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:36:22,483][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:36:22,989][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:36:23,504][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:36:24,010][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:36:24,514][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:36:25,020][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:36:25,526][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:36:26,033][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:36:26,537][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:36:27,041][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:36:27,546][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:36:28,050][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:36:28,557][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:36:29,078][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:36:29,584][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:36:30,090][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:36:30,606][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:36:31,112][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:36:31,620][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:36:32,125][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:36:32,628][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:36:33,132][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:36:33,638][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:36:34,142][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:36:34,648][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:36:35,154][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:36:35,660][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:36:36,164][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:36:36,668][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:36:37,188][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:36:37,692][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:36:38,203][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:36:38,706][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:36:39,211][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:36:39,718][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:36:40,223][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:36:40,728][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:36:41,236][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:36:41,742][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:36:42,246][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:36:42,748][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:36:43,252][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:36:43,758][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:36:44,260][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:36:44,763][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:36:45,278][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10846 tokens. [2025-11-13 01:36:45,942][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:33 [2025-11-13 01:36:46,733][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:36:46,735][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:36:46,737][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:36:47,666][__main__][INFO] - Iteration 236 took 51s (29.54% Gen, 68.66% Train). Generation: 15s, Training: 35s. Estimated remaining time: 39h 39m 1s. Estimated total time: 43h 6m 7s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 12s, 500 more iterations: 7h 11m 1s. [2025-11-13 01:36:47,668][__main__][INFO] - Starting iteration 236. [2025-11-13 01:36:48,158][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 01:36:48,158][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:37:03,933][__main__][INFO] - Number of regex retries in iteration 236: 0 [2025-11-13 01:37:03,934][__main__][INFO] - agents played in iteration 236 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:37:04,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:37:04,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:37:04,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:37:04,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:37:04,845][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:37:04,846][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:37:05,570][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:37:06,039][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:37:06,549][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:37:07,055][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:37:07,558][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:37:08,063][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:37:08,568][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:37:09,073][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:37:09,577][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:37:10,097][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:37:10,602][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:37:11,117][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:37:11,621][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:37:12,125][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:37:12,631][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:37:13,136][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:37:13,641][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:37:14,150][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:37:14,657][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:37:15,167][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:37:15,677][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:37:16,183][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:37:16,687][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:37:17,193][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:37:17,714][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:37:18,222][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:37:18,726][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:37:19,237][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:37:19,744][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:37:20,252][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:37:20,759][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:37:21,265][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:37:21,783][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:37:22,286][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:37:22,792][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:37:23,304][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:37:23,811][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:37:24,326][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:37:24,831][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:37:25,335][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:37:25,841][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:37:26,345][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:37:26,851][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:37:27,356][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:37:27,862][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:37:28,368][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:37:28,872][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:37:29,378][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:37:29,883][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:37:30,388][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:37:30,908][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:37:31,410][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:37:31,915][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:37:32,420][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:37:32,923][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:37:33,433][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:37:33,937][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:37:34,440][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:37:34,944][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:37:35,449][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:37:35,956][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:37:36,460][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:37:36,962][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:37:37,466][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:37:37,968][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10853 tokens. [2025-11-13 01:37:38,681][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.50%, Block Peak % of device VRAM: 62.38%, ΔTime: 00:00:33 [2025-11-13 01:37:39,463][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:37:39,464][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:37:39,467][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:37:40,391][__main__][INFO] - Iteration 237 took 52s (30.20% Gen, 68.03% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 3m 43s. Estimated total time: 43h 31m 42s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 3s, 500 more iterations: 7h 15m 17s. [2025-11-13 01:37:40,393][__main__][INFO] - Starting iteration 237. [2025-11-13 01:37:40,881][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 01:37:40,882][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:37:58,527][__main__][INFO] - Number of regex retries in iteration 237: 0 [2025-11-13 01:37:58,528][__main__][INFO] - agents played in iteration 237 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:37:59,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:37:59,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:37:59,401][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:37:59,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:37:59,423][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:37:59,424][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:38:00,110][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:38:00,573][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:38:01,084][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:38:01,590][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:38:02,099][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:38:02,603][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:38:03,107][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:38:03,627][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:38:04,132][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:38:04,635][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:38:05,137][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:38:05,641][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:38:06,148][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:38:06,652][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:38:07,152][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:38:07,656][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:38:08,162][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:38:08,668][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:38:09,173][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:38:09,677][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:38:10,193][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:38:10,697][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:38:11,201][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:38:11,707][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:38:12,212][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:38:12,735][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:38:13,241][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:38:13,749][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:38:14,258][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:38:14,766][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:38:15,281][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:38:15,790][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:38:16,298][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:38:16,803][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:38:17,306][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:38:17,811][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:38:18,314][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:38:18,818][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:38:19,335][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:38:19,841][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:38:20,346][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:38:20,848][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:38:21,353][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:38:21,862][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:38:22,367][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:38:22,872][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:38:23,378][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:38:23,882][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:38:24,386][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:38:24,890][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:38:25,394][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:38:25,912][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:38:26,417][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:38:26,922][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:38:27,425][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:38:27,928][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:38:28,436][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:38:28,940][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:38:29,447][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:38:29,955][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:38:30,462][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:38:30,967][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:38:31,472][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:38:31,977][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:38:32,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10838 tokens. [2025-11-13 01:38:33,197][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 01:38:33,973][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:38:33,975][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:38:33,976][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:38:34,833][__main__][INFO] - Iteration 238 took 53s (32.71% Gen, 65.70% Train). Generation: 17s, Training: 35s. Estimated remaining time: 41h 28m 46s. Estimated total time: 44h 57m 40s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 55s, 500 more iterations: 7h 29m 36s. [2025-11-13 01:38:34,839][__main__][INFO] - Starting iteration 238. [2025-11-13 01:38:35,333][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 01:38:35,334][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:38:51,442][__main__][INFO] - Number of regex retries in iteration 238: 0 [2025-11-13 01:38:51,443][__main__][INFO] - agents played in iteration 238 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:38:52,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:38:52,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:38:52,337][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:38:52,359][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:38:52,360][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:38:52,361][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:38:53,060][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:38:53,521][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:38:54,029][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:38:54,537][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:38:55,042][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:38:55,550][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:38:56,056][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:38:56,562][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:38:57,067][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:38:57,575][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:38:58,079][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:38:58,587][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:38:59,090][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:38:59,594][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:39:00,109][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:39:00,613][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:39:01,130][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:39:01,640][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:39:02,149][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:39:02,654][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:39:03,167][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:39:03,681][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:39:04,186][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:39:04,696][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:39:05,219][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:39:05,722][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:39:06,227][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:39:06,732][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:39:07,236][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:39:07,745][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:39:08,247][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:39:08,754][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:39:09,259][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:39:09,766][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:39:10,275][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:39:10,778][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:39:11,285][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:39:11,793][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:39:12,300][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:39:12,805][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:39:13,315][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:39:13,818][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:39:14,337][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:39:14,840][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:39:15,345][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:39:15,845][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:39:16,347][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:39:16,853][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:39:17,357][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:39:17,861][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:39:18,366][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:39:18,870][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:39:19,375][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:39:19,878][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:39:20,382][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:39:20,889][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:39:21,392][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:39:21,895][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:39:22,397][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:39:22,900][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:39:23,403][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:39:23,909][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:39:24,415][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:39:24,921][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:39:25,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10851 tokens. [2025-11-13 01:39:26,128][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:33 [2025-11-13 01:39:26,873][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:39:26,874][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:39:26,877][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:39:27,741][__main__][INFO] - Iteration 239 took 52s (30.74% Gen, 67.61% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 10m 39s. Estimated total time: 43h 40m 26s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 20s, 500 more iterations: 7h 16m 44s. [2025-11-13 01:39:27,743][__main__][INFO] - Starting iteration 239. [2025-11-13 01:39:28,212][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 01:39:28,213][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:39:43,535][__main__][INFO] - Number of regex retries in iteration 239: 0 [2025-11-13 01:39:43,536][__main__][INFO] - agents played in iteration 239 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:39:44,389][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:39:44,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:39:44,442][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:39:44,465][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:39:44,466][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:39:44,467][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:39:45,144][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:39:45,608][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:39:46,124][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:39:46,630][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:39:47,145][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:39:47,650][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:39:48,156][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:39:48,666][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:39:49,170][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:39:49,675][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:39:50,177][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:39:50,681][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:39:51,186][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:39:51,690][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:39:52,194][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:39:52,699][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:39:53,200][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:39:53,705][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:39:54,208][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:39:54,711][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:39:55,215][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:39:55,720][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:39:56,223][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:39:56,738][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:39:57,244][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:39:57,763][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:39:58,273][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:39:58,796][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:39:59,302][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:39:59,807][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:40:00,313][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:40:00,819][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:40:01,325][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:40:01,832][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:40:02,336][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:40:02,843][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:40:03,348][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:40:03,857][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:40:04,375][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:40:04,879][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:40:05,382][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:40:05,892][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:40:06,400][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:40:06,905][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:40:07,409][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:40:07,912][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:40:08,418][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:40:08,921][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:40:09,424][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:40:09,926][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:40:10,428][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:40:10,942][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:40:11,445][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:40:11,948][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:40:12,455][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:40:12,958][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:40:13,464][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:40:13,972][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:40:14,504][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:40:15,011][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:40:15,518][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:40:16,024][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:40:16,528][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:40:17,035][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:40:17,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10869 tokens. [2025-11-13 01:40:18,294][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 01:40:19,063][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:40:19,064][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:40:19,066][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:40:19,957][__main__][INFO] - Iteration 240 took 51s (29.61% Gen, 68.66% Train). Generation: 15s, Training: 35s. Estimated remaining time: 39h 36m 37s. Estimated total time: 43h 7m 15s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 14s, 500 more iterations: 7h 11m 12s. [2025-11-13 01:40:19,959][__main__][INFO] - Starting iteration 240. [2025-11-13 01:40:20,428][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 23 and human policies 1. [2025-11-13 01:40:20,428][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:40:35,569][__main__][INFO] - Number of regex retries in iteration 240: 0 [2025-11-13 01:40:35,570][__main__][INFO] - agents played in iteration 240 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:40:36,416][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:40:36,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:40:36,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:40:36,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:40:36,494][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:40:36,494][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:40:37,143][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:40:37,605][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:40:38,116][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:40:38,621][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:40:39,127][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:40:39,632][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:40:40,136][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:40:40,644][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:40:41,171][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:40:41,677][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:40:42,182][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:40:42,687][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:40:43,195][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:40:43,700][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:40:44,206][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:40:44,710][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:40:45,216][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:40:45,724][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:40:46,232][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:40:46,748][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:40:47,255][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:40:47,775][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:40:48,277][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:40:48,784][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:40:49,291][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:40:49,797][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:40:50,301][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:40:50,804][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:40:51,310][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:40:51,815][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:40:52,320][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:40:52,822][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:40:53,330][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:40:53,839][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:40:54,351][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:40:54,856][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:40:55,363][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:40:55,869][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:40:56,373][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:40:56,881][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:40:57,383][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:40:57,885][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:40:58,392][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:40:58,895][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:40:59,397][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:40:59,900][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:41:00,403][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:41:00,907][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:41:01,413][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:41:01,917][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:41:02,422][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:41:02,925][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:41:03,429][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:41:03,934][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:41:04,436][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:41:04,968][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:41:05,475][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:41:05,980][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:41:06,485][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:41:06,992][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:41:07,499][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:41:08,004][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:41:08,509][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:41:09,014][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:41:09,520][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10843 tokens. [2025-11-13 01:41:10,243][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.39%, ΔTime: 00:00:33 [2025-11-13 01:41:11,012][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:41:11,014][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:41:11,015][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:41:12,884][__main__][INFO] - Iteration 241 took 52s (28.86% Gen, 67.57% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 11m 19s. Estimated total time: 43h 42m 51s. Time estimates for 10 more iterations: 8m 44s, 100 more iterations: 1h 27m 25s, 500 more iterations: 7h 17m 8s. [2025-11-13 01:41:12,886][__main__][INFO] - Starting iteration 241. [2025-11-13 01:41:13,361][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 01:41:13,362][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:41:17,989][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:41:25,147][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:41:29,986][__main__][INFO] - Number of regex retries in iteration 241: 2 [2025-11-13 01:41:29,986][__main__][INFO] - agents played in iteration 241 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:41:30,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:41:30,851][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:41:30,877][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:41:30,913][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:41:30,914][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:41:30,915][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:41:31,552][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:41:32,010][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:41:32,520][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:41:33,025][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:41:33,533][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:41:34,040][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:41:34,545][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:41:35,078][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:41:35,666][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:41:36,168][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:41:36,678][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:41:37,184][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:41:37,694][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:41:38,199][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:41:38,705][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:41:39,212][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:41:39,720][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:41:40,225][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:41:40,727][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:41:41,231][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:41:41,738][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:41:42,241][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:41:42,745][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:41:43,248][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:41:43,753][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:41:44,281][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:41:44,786][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:41:45,292][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:41:45,800][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:41:46,305][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:41:46,815][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:41:47,322][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:41:47,829][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:41:48,334][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:41:48,837][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:41:49,341][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:41:49,845][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:41:50,350][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:41:50,853][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:41:51,357][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:41:51,860][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:41:52,364][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:41:52,868][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:41:53,390][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:41:53,897][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:41:54,403][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:41:54,911][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:41:55,416][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:41:55,923][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:41:56,433][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:41:56,937][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:41:57,446][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:41:57,955][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:41:58,462][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:41:58,975][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:41:59,482][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:41:59,995][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:42:00,502][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:42:01,007][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:42:01,512][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:42:02,019][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:42:02,525][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:42:03,031][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:42:03,538][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:42:04,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10837 tokens. [2025-11-13 01:42:04,780][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.42%, ΔTime: 00:00:33 [2025-11-13 01:42:05,556][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:42:05,557][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:42:05,559][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:42:06,475][__main__][INFO] - Iteration 242 took 53s (31.30% Gen, 66.97% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 43m 19s. Estimated total time: 44h 15m 43s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 31s, 500 more iterations: 7h 22m 37s. [2025-11-13 01:42:06,477][__main__][INFO] - Starting iteration 242. [2025-11-13 01:42:06,963][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 01:42:06,964][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:42:20,468][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:42:22,958][__main__][INFO] - Number of regex retries in iteration 242: 1 [2025-11-13 01:42:22,959][__main__][INFO] - agents played in iteration 242 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:42:23,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:42:23,830][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:42:23,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:42:23,874][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:42:23,874][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:42:23,875][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:42:24,506][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:42:24,964][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:42:25,476][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:42:25,980][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:42:26,490][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:42:26,995][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:42:27,503][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:42:28,012][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:42:28,520][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:42:29,031][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:42:29,537][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:42:30,045][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:42:30,552][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:42:31,058][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:42:31,563][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:42:32,077][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:42:32,582][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:42:33,091][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:42:33,594][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:42:34,098][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:42:34,611][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:42:35,116][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:42:35,624][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:42:36,130][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:42:36,635][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:42:37,143][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:42:37,646][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:42:38,149][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:42:38,667][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:42:39,173][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:42:39,690][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:42:40,199][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:42:40,708][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:42:41,221][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:42:41,728][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:42:42,234][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:42:42,740][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:42:43,247][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:42:43,757][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:42:44,265][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:42:44,771][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:42:45,280][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:42:45,791][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:42:46,305][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:42:46,814][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:42:47,320][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:42:47,826][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:42:48,332][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:42:48,838][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:42:49,346][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:42:49,852][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:42:50,358][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:42:50,867][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:42:51,373][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:42:51,882][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:42:52,387][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:42:52,893][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:42:53,398][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:42:53,901][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:42:54,417][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:42:54,921][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:42:55,429][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:42:55,932][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:42:56,439][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:42:56,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10862 tokens. [2025-11-13 01:42:57,648][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 01:42:58,420][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:42:58,423][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:42:58,425][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:42:59,340][__main__][INFO] - Iteration 243 took 52s (30.54% Gen, 67.71% Train). Generation: 15s, Training: 35s. Estimated remaining time: 40h 5m 34s. Estimated total time: 43h 38m 52s. Time estimates for 10 more iterations: 8m 43s, 100 more iterations: 1h 27m 17s, 500 more iterations: 7h 16m 28s. [2025-11-13 01:42:59,342][__main__][INFO] - Starting iteration 243. [2025-11-13 01:42:59,832][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 01:42:59,833][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:43:15,589][__main__][INFO] - Number of regex retries in iteration 243: 0 [2025-11-13 01:43:15,590][__main__][INFO] - agents played in iteration 243 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:43:16,431][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:43:16,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:43:16,491][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:43:16,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:43:16,516][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:43:16,516][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:43:17,182][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:43:17,644][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:43:18,156][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:43:18,665][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:43:19,172][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:43:19,690][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:43:20,198][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:43:20,727][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:43:21,231][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:43:21,738][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:43:22,244][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:43:22,750][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:43:23,262][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:43:23,766][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:43:24,271][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:43:24,785][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:43:25,290][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:43:25,795][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:43:26,299][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:43:26,802][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:43:27,322][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:43:27,824][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:43:28,328][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:43:28,831][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:43:29,332][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:43:29,837][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:43:30,341][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:43:30,844][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:43:31,350][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:43:31,852][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:43:32,354][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:43:32,859][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:43:33,368][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:43:33,881][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:43:34,386][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:43:34,889][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:43:35,402][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:43:35,904][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:43:36,424][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:43:36,928][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:43:37,438][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:43:37,946][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:43:38,450][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:43:38,959][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:43:39,468][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:43:39,975][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:43:40,483][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:43:40,989][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:43:41,493][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:43:42,001][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:43:42,505][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:43:43,016][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:43:43,523][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:43:44,024][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:43:44,526][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:43:45,030][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:43:45,531][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:43:46,038][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:43:46,542][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:43:47,048][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:43:47,551][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:43:48,055][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:43:48,560][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:43:49,064][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:43:49,570][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10834 tokens. [2025-11-13 01:43:50,259][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 01:43:51,040][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:43:51,042][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:43:51,044][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:43:51,954][__main__][INFO] - Iteration 244 took 52s (30.23% Gen, 68.02% Train). Generation: 15s, Training: 35s. Estimated remaining time: 39h 51m 57s. Estimated total time: 43h 26m 7s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 52s, 500 more iterations: 7h 14m 21s. [2025-11-13 01:43:51,956][__main__][INFO] - Starting iteration 244. [2025-11-13 01:43:52,437][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 01:43:52,438][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:43:56,785][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:44:08,278][__main__][INFO] - Number of regex retries in iteration 244: 1 [2025-11-13 01:44:08,278][__main__][INFO] - agents played in iteration 244 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:44:09,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:44:09,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:44:09,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:44:09,184][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:44:09,185][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:44:09,186][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:44:09,819][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:44:10,278][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:44:10,784][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:44:11,286][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:44:11,803][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:44:12,309][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:44:12,813][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:44:13,319][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:44:13,824][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:44:14,332][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:44:14,836][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:44:15,340][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:44:15,848][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:44:16,353][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:44:16,858][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:44:17,363][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:44:17,867][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:44:18,373][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:44:18,878][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:44:19,380][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:44:19,882][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:44:20,383][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:44:20,885][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:44:21,386][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:44:21,886][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:44:22,392][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:44:22,895][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:44:23,399][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:44:23,907][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:44:24,412][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:44:24,931][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:44:25,435][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:44:25,940][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:44:26,443][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:44:26,947][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:44:27,451][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:44:27,954][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:44:28,461][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:44:28,968][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:44:29,477][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:44:29,984][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:44:30,493][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:44:31,000][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:44:31,524][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:44:32,031][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:44:32,542][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:44:33,045][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:44:33,551][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:44:34,063][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:44:34,568][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:44:35,069][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:44:35,576][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:44:36,078][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:44:36,580][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:44:37,083][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:44:37,586][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:44:38,089][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:44:38,591][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:44:39,093][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:44:39,607][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:44:40,110][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:44:40,625][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:44:41,125][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:44:41,627][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:44:42,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10842 tokens. [2025-11-13 01:44:42,819][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 01:44:43,602][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:44:43,603][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:44:43,605][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:44:44,529][__main__][INFO] - Iteration 245 took 52s (30.41% Gen, 67.82% Train). Generation: 15s, Training: 35s. Estimated remaining time: 39h 49m 34s. Estimated total time: 43h 24m 37s. Time estimates for 10 more iterations: 8m 40s, 100 more iterations: 1h 26m 49s, 500 more iterations: 7h 14m 6s. [2025-11-13 01:44:44,532][__main__][INFO] - Starting iteration 245. [2025-11-13 01:44:45,005][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 01:44:45,006][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:45:00,093][__main__][INFO] - Number of regex retries in iteration 245: 0 [2025-11-13 01:45:00,094][__main__][INFO] - agents played in iteration 245 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:45:00,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:45:00,946][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:45:00,968][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:45:00,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:45:00,990][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:45:00,990][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:45:01,630][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:45:02,094][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:45:02,603][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:45:03,121][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:45:03,623][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:45:04,122][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:45:04,628][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:45:05,134][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:45:05,642][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:45:06,148][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:45:06,657][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:45:07,164][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:45:07,667][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:45:08,173][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:45:08,680][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:45:09,185][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:45:09,695][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:45:10,200][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:45:10,705][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:45:11,216][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:45:11,722][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:45:12,234][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:45:12,738][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:45:13,244][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:45:13,753][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:45:14,257][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:45:14,762][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:45:15,265][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:45:15,768][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:45:16,274][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:45:16,778][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:45:17,280][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:45:17,784][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:45:18,285][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:45:18,786][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:45:19,287][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:45:19,791][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:45:20,304][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:45:20,809][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:45:21,318][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:45:21,824][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:45:22,332][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:45:22,848][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:45:23,355][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:45:23,863][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:45:24,368][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:45:24,874][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:45:25,381][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:45:25,888][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:45:26,394][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:45:26,913][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:45:27,419][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:45:27,923][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:45:28,428][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:45:28,933][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:45:29,438][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:45:29,941][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:45:30,446][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:45:30,950][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:45:31,458][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:45:31,966][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:45:32,470][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:45:32,974][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:45:33,476][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:45:33,980][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10867 tokens. [2025-11-13 01:45:34,674][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 01:45:35,436][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:45:35,438][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:45:35,441][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:45:36,356][__main__][INFO] - Iteration 246 took 51s (29.38% Gen, 68.83% Train). Generation: 15s, Training: 35s. Estimated remaining time: 39h 11m 40s. Estimated total time: 42h 47m 35s. Time estimates for 10 more iterations: 8m 33s, 100 more iterations: 1h 25m 35s, 500 more iterations: 7h 7m 55s. [2025-11-13 01:45:36,359][__main__][INFO] - Starting iteration 246. [2025-11-13 01:45:36,826][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 01:45:36,826][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:45:40,109][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:45:49,167][__main__][INFO] - Number of regex retries in iteration 246: 1 [2025-11-13 01:45:49,168][__main__][INFO] - agents played in iteration 246 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:45:50,023][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:45:50,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:45:50,088][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:45:50,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:45:50,112][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:45:50,113][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:45:50,745][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:45:51,206][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:45:51,709][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:45:52,215][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:45:52,718][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:45:53,221][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:45:53,723][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:45:54,240][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:45:54,742][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:45:55,243][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:45:55,744][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:45:56,245][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:45:56,751][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:45:57,252][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:45:57,753][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:45:58,260][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:45:58,764][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:45:59,274][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:45:59,779][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:46:00,285][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:46:00,791][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:46:01,296][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:46:01,801][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:46:02,308][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:46:02,814][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:46:03,319][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:46:03,823][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:46:04,327][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:46:04,845][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:46:05,349][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:46:05,854][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:46:06,357][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:46:06,861][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:46:07,364][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:46:07,865][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:46:08,364][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:46:08,863][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:46:09,362][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:46:09,862][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:46:10,361][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:46:10,865][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:46:11,370][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:46:11,876][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:46:12,384][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:46:12,890][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:46:13,395][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:46:13,899][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:46:14,401][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:46:14,905][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:46:15,411][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:46:15,916][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:46:16,421][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:46:16,927][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:46:17,433][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:46:17,949][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:46:18,456][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:46:18,965][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:46:19,469][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:46:19,975][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:46:20,483][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:46:20,989][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:46:21,495][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:46:22,002][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:46:22,509][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:46:23,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10866 tokens. [2025-11-13 01:46:23,745][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 01:46:24,523][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:46:24,525][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:46:24,527][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:46:25,425][__main__][INFO] - Iteration 247 took 48s (25.39% Gen, 72.76% Train). Generation: 12s, Training: 35s. Estimated remaining time: 36h 53m 16s. Estimated total time: 40h 30m 0s. Time estimates for 10 more iterations: 8m 6s, 100 more iterations: 1h 21m 0s, 500 more iterations: 6h 45m 0s. [2025-11-13 01:46:25,427][__main__][INFO] - Starting iteration 247. [2025-11-13 01:46:25,895][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 01:46:25,896][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:46:40,912][__main__][INFO] - Number of regex retries in iteration 247: 0 [2025-11-13 01:46:40,913][__main__][INFO] - agents played in iteration 247 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:46:41,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:46:41,709][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:46:41,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:46:41,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:46:41,753][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:46:41,755][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:46:42,387][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:46:42,850][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:46:43,358][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:46:43,860][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:46:44,368][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:46:44,869][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:46:45,371][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:46:45,876][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:46:46,378][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:46:46,882][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:46:47,384][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:46:47,884][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:46:48,387][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:46:48,888][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:46:49,389][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:46:49,887][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:46:50,388][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:46:50,891][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:46:51,394][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:46:51,897][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:46:52,399][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:46:52,900][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:46:53,407][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:46:53,912][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:46:54,416][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:46:54,937][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:46:55,442][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:46:55,942][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:46:56,447][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:46:56,950][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:46:57,454][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:46:57,958][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:46:58,462][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:46:58,968][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:46:59,474][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:46:59,982][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:47:00,486][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:47:00,989][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:47:01,490][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:47:01,990][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:47:02,493][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:47:03,001][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:47:03,505][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:47:04,009][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:47:04,513][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:47:05,018][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:47:05,523][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:47:06,027][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:47:06,529][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:47:07,038][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:47:07,543][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:47:08,062][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:47:08,567][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:47:09,074][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:47:09,580][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:47:10,087][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:47:10,596][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:47:11,104][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:47:11,606][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:47:12,118][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:47:12,620][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:47:13,132][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:47:13,638][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:47:14,142][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:47:14,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10854 tokens. [2025-11-13 01:47:15,441][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 01:47:16,223][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:47:16,225][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:47:16,227][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:47:17,123][__main__][INFO] - Iteration 248 took 51s (29.31% Gen, 68.93% Train). Generation: 15s, Training: 35s. Estimated remaining time: 39h 3m 50s. Estimated total time: 42h 41m 26s. Time estimates for 10 more iterations: 8m 32s, 100 more iterations: 1h 25m 22s, 500 more iterations: 7h 6m 54s. [2025-11-13 01:47:17,125][__main__][INFO] - Starting iteration 248. [2025-11-13 01:47:17,613][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 01:47:17,613][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:47:31,849][__main__][INFO] - Number of regex retries in iteration 248: 0 [2025-11-13 01:47:31,850][__main__][INFO] - agents played in iteration 248 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:47:32,649][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:47:32,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:47:32,714][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:47:32,739][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:47:32,739][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:47:32,741][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:47:33,393][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:47:33,852][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:47:34,373][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:47:34,877][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:47:35,379][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:47:35,882][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:47:36,383][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:47:36,885][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:47:37,383][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:47:37,887][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:47:38,392][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:47:38,895][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:47:39,398][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:47:39,899][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:47:40,401][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:47:40,903][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:47:41,405][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:47:41,905][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:47:42,408][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:47:42,908][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:47:43,409][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:47:43,909][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:47:44,409][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:47:44,913][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:47:45,413][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:47:45,912][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:47:46,413][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:47:46,914][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:47:47,416][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:47:47,916][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:47:48,417][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:47:48,920][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:47:49,423][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:47:49,932][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:47:50,463][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:47:50,967][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:47:51,485][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:47:51,989][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:47:52,494][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:47:52,997][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:47:53,504][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:47:54,017][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:47:54,522][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:47:55,030][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:47:55,548][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:47:56,054][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:47:56,570][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:47:57,074][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:47:57,581][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:47:58,092][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:47:58,598][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:47:59,104][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:47:59,612][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:48:00,115][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:48:00,625][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:48:01,128][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:48:01,634][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:48:02,139][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:48:02,643][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:48:03,149][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:48:03,661][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:48:04,168][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:48:04,682][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:48:05,187][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:48:05,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10813 tokens. [2025-11-13 01:48:06,424][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.49%, ΔTime: 00:00:33 [2025-11-13 01:48:07,211][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:48:07,212][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:48:07,214][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:48:08,168][__main__][INFO] - Iteration 249 took 50s (28.16% Gen, 69.95% Train). Generation: 14s, Training: 35s. Estimated remaining time: 38h 29m 21s. Estimated total time: 42h 7m 48s. Time estimates for 10 more iterations: 8m 25s, 100 more iterations: 1h 24m 15s, 500 more iterations: 7h 1m 18s. [2025-11-13 01:48:08,171][__main__][INFO] - Starting iteration 249. [2025-11-13 01:48:08,663][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 01:48:08,664][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:48:14,114][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:48:26,626][__main__][INFO] - Number of regex retries in iteration 249: 1 [2025-11-13 01:48:26,627][__main__][INFO] - agents played in iteration 249 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:48:27,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:48:27,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:48:27,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:48:27,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:48:27,595][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:48:27,596][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:48:28,246][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:48:28,719][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:48:29,226][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:48:29,731][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:48:30,233][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:48:30,737][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:48:31,248][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:48:31,752][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:48:32,253][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:48:32,762][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:48:33,266][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:48:33,775][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:48:34,277][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:48:34,777][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:48:35,289][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:48:35,793][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:48:36,295][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:48:36,799][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:48:37,303][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:48:37,806][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:48:38,308][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:48:38,809][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:48:39,312][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:48:39,813][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:48:40,318][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:48:40,819][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:48:41,321][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:48:41,825][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:48:42,326][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:48:42,826][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:48:43,333][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:48:43,839][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:48:44,349][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:48:44,860][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:48:45,382][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:48:45,894][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:48:46,397][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:48:46,906][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:48:47,414][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:48:47,924][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:48:48,430][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:48:48,941][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:48:49,447][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:48:49,954][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:48:50,461][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:48:50,969][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:48:51,474][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:48:51,980][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:48:52,492][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:48:52,999][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:48:53,504][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:48:54,010][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:48:54,515][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:48:55,020][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:48:55,521][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:48:56,027][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:48:56,532][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:48:57,037][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:48:57,544][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:48:58,049][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:48:58,553][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:48:59,082][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:48:59,588][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:49:00,095][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:49:00,606][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10808 tokens. [2025-11-13 01:49:01,379][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.26%, Current % of VRAM taken: 58.51%, Block Peak % of device VRAM: 62.38%, ΔTime: 00:00:33 [2025-11-13 01:49:02,161][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:49:02,163][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:49:02,164][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:49:03,084][__main__][INFO] - Iteration 250 took 54s (33.01% Gen, 65.30% Train). Generation: 17s, Training: 35s. Estimated remaining time: 41h 41m 41s. Estimated total time: 45h 21m 2s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 42s, 500 more iterations: 7h 33m 30s. [2025-11-13 01:49:03,086][__main__][INFO] - Starting iteration 250. [2025-11-13 01:49:03,569][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 24 and human policies 1. [2025-11-13 01:49:03,569][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:49:08,217][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:49:16,612][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:49:20,214][__main__][INFO] - Number of regex retries in iteration 250: 2 [2025-11-13 01:49:20,215][__main__][INFO] - agents played in iteration 250 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:49:21,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:49:21,135][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:49:21,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:49:21,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:49:21,190][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:49:21,191][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:49:21,884][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:49:22,341][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:49:22,864][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:49:23,370][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:49:23,872][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:49:24,374][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:49:24,877][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:49:25,379][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:49:25,880][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:49:26,386][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:49:26,888][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:49:27,391][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:49:27,895][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:49:28,393][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:49:28,898][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:49:29,407][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:49:29,913][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:49:30,416][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:49:30,918][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:49:31,422][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:49:31,926][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:49:32,426][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:49:32,928][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:49:33,431][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:49:33,933][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:49:34,436][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:49:34,938][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:49:35,437][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:49:35,941][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:49:36,443][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:49:36,948][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:49:37,464][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:49:37,969][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:49:38,483][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:49:38,991][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:49:39,494][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:49:40,006][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:49:40,511][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:49:41,015][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:49:41,522][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:49:42,029][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:49:42,558][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:49:43,068][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:49:43,578][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:49:44,085][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:49:44,593][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:49:45,104][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:49:45,612][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:49:46,119][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:49:46,622][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:49:47,126][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:49:47,632][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:49:48,137][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:49:48,641][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:49:49,155][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:49:49,660][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:49:50,171][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:49:50,676][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:49:51,180][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:49:51,686][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:49:52,190][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:49:52,692][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:49:53,198][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:49:53,701][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:49:54,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10837 tokens. [2025-11-13 01:49:54,992][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.49%, ΔTime: 00:00:33 [2025-11-13 01:49:55,795][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:49:55,797][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:49:55,799][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:49:57,625][__main__][INFO] - Iteration 251 took 54s (30.79% Gen, 65.83% Train). Generation: 16s, Training: 35s. Estimated remaining time: 41h 22m 36s. Estimated total time: 45h 2m 52s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 5s, 500 more iterations: 7h 30m 28s. [2025-11-13 01:49:57,628][__main__][INFO] - Starting iteration 251. [2025-11-13 01:49:58,119][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 01:49:58,120][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:50:08,601][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls OffsetTable is a crucial concept in computer memory management. Could you explain its role in the context of function parameters and return addresses in a stack frame? How does it differ when dealing with different calling conventions like cdecl and stdcall? Also, could you provide an example of how the Offset Table might look in a simple function call scenario? did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:50:15,677][__main__][INFO] - Number of regex retries in iteration 251: 1 [2025-11-13 01:50:15,678][__main__][INFO] - agents played in iteration 251 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:50:16,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:50:16,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:50:16,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:50:16,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:50:16,545][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:50:16,545][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:50:17,183][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:50:17,637][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:50:18,147][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:50:18,645][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:50:19,145][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:50:19,647][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:50:20,146][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:50:20,645][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:50:21,145][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:50:21,649][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:50:22,149][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:50:22,653][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:50:23,158][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:50:23,661][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:50:24,163][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:50:24,665][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:50:25,168][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:50:25,671][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:50:26,178][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:50:26,684][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:50:27,197][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:50:27,699][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:50:28,212][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:50:28,713][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:50:29,216][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:50:29,720][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:50:30,223][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:50:30,729][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:50:31,237][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:50:31,743][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:50:32,249][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:50:32,756][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:50:33,259][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:50:33,764][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:50:34,270][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:50:34,776][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:50:35,283][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:50:35,788][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:50:36,292][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:50:36,794][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:50:37,300][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:50:37,803][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:50:38,302][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:50:38,803][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:50:39,305][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:50:39,807][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:50:40,308][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:50:40,810][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:50:41,311][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:50:41,825][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:50:42,328][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:50:42,831][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:50:43,334][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:50:43,836][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:50:44,338][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:50:44,842][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:50:45,347][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:50:45,850][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:50:46,355][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:50:46,861][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:50:47,366][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:50:47,868][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:50:48,375][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:50:48,879][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:50:49,384][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10774 tokens. [2025-11-13 01:50:50,119][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.17%, ΔTime: 00:00:32 [2025-11-13 01:50:50,873][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:50:50,874][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:50:50,876][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:50:51,735][__main__][INFO] - Iteration 252 took 53s (32.75% Gen, 65.65% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 59m 37s. Estimated total time: 44h 40m 47s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 21s, 500 more iterations: 7h 26m 47s. [2025-11-13 01:50:51,737][__main__][INFO] - Starting iteration 252. [2025-11-13 01:50:52,218][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 01:50:52,218][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:50:56,108][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:51:07,552][__main__][INFO] - Number of regex retries in iteration 252: 1 [2025-11-13 01:51:07,552][__main__][INFO] - agents played in iteration 252 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:51:08,416][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:51:08,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:51:08,465][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:51:08,487][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:51:08,488][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:51:08,488][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:51:09,129][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:51:09,589][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:51:10,096][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:51:10,619][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:51:11,121][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:51:11,627][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:51:12,126][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:51:12,626][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:51:13,129][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:51:13,634][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:51:14,138][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:51:14,644][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:51:15,147][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:51:15,649][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:51:16,154][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:51:16,658][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:51:17,162][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:51:17,664][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:51:18,165][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:51:18,667][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:51:19,169][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:51:19,671][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:51:20,173][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:51:20,676][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:51:21,177][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:51:21,682][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:51:22,185][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:51:22,702][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:51:23,205][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:51:23,720][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:51:24,224][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:51:24,727][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:51:25,242][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:51:25,747][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:51:26,253][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:51:26,759][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:51:27,266][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:51:27,772][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:51:28,278][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:51:28,783][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:51:29,296][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:51:29,804][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:51:30,321][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:51:30,827][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:51:31,332][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:51:31,846][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:51:32,349][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:51:32,854][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:51:33,359][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:51:33,866][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:51:34,373][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:51:34,876][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:51:35,381][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:51:35,892][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:51:36,397][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:51:36,901][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:51:37,405][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:51:37,909][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:51:38,429][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:51:38,936][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:51:39,443][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:51:39,946][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:51:40,450][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:51:40,956][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:51:41,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10831 tokens. [2025-11-13 01:51:42,233][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:33 [2025-11-13 01:51:43,008][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:51:43,010][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:51:43,013][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:51:43,926][__main__][INFO] - Iteration 253 took 51s (29.65% Gen, 68.58% Train). Generation: 15s, Training: 35s. Estimated remaining time: 39h 23m 24s. Estimated total time: 43h 5m 26s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 10s, 500 more iterations: 7h 10m 54s. [2025-11-13 01:51:43,928][__main__][INFO] - Starting iteration 253. [2025-11-13 01:51:44,403][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 01:51:44,404][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:51:50,004][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:52:00,513][__main__][INFO] - Number of regex retries in iteration 253: 1 [2025-11-13 01:52:00,513][__main__][INFO] - agents played in iteration 253 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:52:01,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:52:01,451][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:52:01,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:52:01,498][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:52:01,499][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:52:01,499][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:52:02,215][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:52:02,675][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:52:03,183][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:52:03,688][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:52:04,191][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:52:04,695][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:52:05,197][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:52:05,701][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:52:06,205][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:52:06,705][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:52:07,216][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:52:07,718][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:52:08,220][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:52:08,722][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:52:09,224][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:52:09,725][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:52:10,226][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:52:10,727][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:52:11,241][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:52:11,744][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:52:12,245][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:52:12,748][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:52:13,250][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:52:13,754][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:52:14,256][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:52:14,759][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:52:15,267][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:52:15,772][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:52:16,279][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:52:16,785][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:52:17,288][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:52:17,793][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:52:18,299][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:52:18,804][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:52:19,320][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:52:19,825][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:52:20,336][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:52:20,841][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:52:21,347][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:52:21,853][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:52:22,360][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:52:22,871][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:52:23,379][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:52:23,885][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:52:24,391][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:52:24,895][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:52:25,399][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:52:25,911][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:52:26,418][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:52:26,930][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:52:27,435][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:52:27,939][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:52:28,444][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:52:28,947][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:52:29,451][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:52:29,953][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:52:30,458][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:52:30,970][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:52:31,478][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:52:31,982][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:52:32,486][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:52:32,994][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:52:33,513][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:52:34,021][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:52:34,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10831 tokens. [2025-11-13 01:52:35,267][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.26%, Current % of VRAM taken: 58.51%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:33 [2025-11-13 01:52:36,031][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:52:36,033][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:52:36,035][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:52:36,932][__main__][INFO] - Iteration 254 took 52s (30.67% Gen, 67.62% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 3m 32s. Estimated total time: 43h 46m 27s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 32s, 500 more iterations: 7h 17m 44s. [2025-11-13 01:52:36,934][__main__][INFO] - Starting iteration 254. [2025-11-13 01:52:37,401][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 01:52:37,402][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:52:45,522][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:52:51,745][__main__][INFO] - Number of regex retries in iteration 254: 1 [2025-11-13 01:52:51,746][__main__][INFO] - agents played in iteration 254 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:52:52,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:52:52,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:52:52,577][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:52:52,600][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:52:52,601][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:52:52,602][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:52:53,261][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:52:53,720][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:52:54,232][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:52:54,751][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:52:55,256][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:52:55,761][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:52:56,264][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:52:56,769][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:52:57,278][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:52:57,788][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:52:58,293][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:52:58,803][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:52:59,306][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:52:59,810][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:53:00,314][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:53:00,818][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:53:01,336][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:53:01,839][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:53:02,341][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:53:02,844][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:53:03,345][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:53:03,853][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:53:04,355][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:53:04,858][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:53:05,362][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:53:05,865][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:53:06,369][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:53:06,878][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:53:07,385][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:53:07,893][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:53:08,404][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:53:08,912][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:53:09,436][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:53:09,943][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:53:10,450][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:53:10,956][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:53:11,462][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:53:11,971][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:53:12,479][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:53:12,987][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:53:13,496][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:53:14,002][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:53:14,508][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:53:15,014][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:53:15,520][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:53:16,043][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:53:16,548][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:53:17,057][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:53:17,559][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:53:18,065][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:53:18,575][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:53:19,081][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:53:19,586][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:53:20,092][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:53:20,600][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:53:21,106][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:53:21,611][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:53:22,115][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:53:22,632][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:53:23,137][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:53:23,652][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:53:24,157][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:53:24,662][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:53:25,172][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:53:25,678][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10867 tokens. [2025-11-13 01:53:26,389][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 01:53:27,134][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:53:27,135][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:53:27,138][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:53:28,050][__main__][INFO] - Iteration 255 took 50s (28.32% Gen, 69.88% Train). Generation: 14s, Training: 35s. Estimated remaining time: 38h 28m 41s. Estimated total time: 42h 12m 28s. Time estimates for 10 more iterations: 8m 26s, 100 more iterations: 1h 24m 24s, 500 more iterations: 7h 2m 4s. [2025-11-13 01:53:28,053][__main__][INFO] - Starting iteration 255. [2025-11-13 01:53:28,525][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 01:53:28,525][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:53:35,036][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:53:43,976][__main__][INFO] - Number of regex retries in iteration 255: 1 [2025-11-13 01:53:43,977][__main__][INFO] - agents played in iteration 255 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:53:44,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:53:44,797][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:53:44,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:53:44,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:53:44,844][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:53:44,845][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:53:45,509][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:53:45,972][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:53:46,481][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:53:46,986][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:53:47,496][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:53:48,002][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:53:48,506][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:53:49,014][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:53:49,520][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:53:50,025][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:53:50,531][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:53:51,037][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:53:51,558][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:53:52,062][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:53:52,566][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:53:53,067][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:53:53,573][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:53:54,079][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:53:54,583][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:53:55,085][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:53:55,588][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:53:56,093][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:53:56,597][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:53:57,100][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:53:57,603][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:53:58,107][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:53:58,610][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:53:59,113][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:53:59,617][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:54:00,122][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:54:00,626][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:54:01,133][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:54:01,641][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:54:02,161][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:54:02,668][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:54:03,175][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:54:03,681][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:54:04,189][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:54:04,695][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:54:05,202][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:54:05,708][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:54:06,213][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:54:06,721][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:54:07,230][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:54:07,736][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:54:08,242][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:54:08,749][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:54:09,254][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:54:09,761][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:54:10,265][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:54:10,770][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:54:11,274][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:54:11,781][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:54:12,286][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:54:12,793][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:54:13,300][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:54:13,807][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:54:14,316][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:54:14,820][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:54:15,325][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:54:15,828][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:54:16,331][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:54:16,835][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:54:17,340][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:54:17,862][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10837 tokens. [2025-11-13 01:54:18,556][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:33 [2025-11-13 01:54:19,317][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:54:19,318][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:54:19,320][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:54:20,239][__main__][INFO] - Iteration 256 took 51s (29.88% Gen, 68.34% Train). Generation: 15s, Training: 35s. Estimated remaining time: 39h 21m 5s. Estimated total time: 43h 5m 44s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 11s, 500 more iterations: 7h 10m 57s. [2025-11-13 01:54:20,241][__main__][INFO] - Starting iteration 256. [2025-11-13 01:54:20,727][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 01:54:20,727][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:54:25,968][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:54:38,165][__main__][INFO] - Number of regex retries in iteration 256: 1 [2025-11-13 01:54:38,165][__main__][INFO] - agents played in iteration 256 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:54:38,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:54:38,977][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:54:39,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:54:39,025][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:54:39,026][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:54:39,026][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:54:39,681][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:54:40,155][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:54:40,663][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:54:41,169][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:54:41,673][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:54:42,177][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:54:42,685][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:54:43,188][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:54:43,691][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:54:44,201][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:54:44,706][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:54:45,209][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:54:45,707][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:54:46,210][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:54:46,713][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:54:47,211][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:54:47,714][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:54:48,218][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:54:48,719][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:54:49,221][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:54:49,729][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:54:50,235][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:54:50,751][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:54:51,254][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:54:51,756][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:54:52,261][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:54:52,763][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:54:53,272][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:54:53,778][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:54:54,283][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:54:54,789][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:54:55,292][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:54:55,797][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:54:56,301][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:54:56,803][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:54:57,308][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:54:57,813][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:54:58,322][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:54:58,834][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:54:59,339][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:54:59,851][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:55:00,357][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:55:00,863][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:55:01,373][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:55:01,879][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:55:02,386][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:55:02,891][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:55:03,397][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:55:03,903][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:55:04,413][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:55:04,920][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:55:05,427][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:55:05,932][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:55:06,443][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:55:06,947][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:55:07,453][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:55:07,959][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:55:08,465][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:55:08,975][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:55:09,479][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:55:09,983][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:55:10,488][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:55:10,992][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:55:11,496][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:55:11,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10823 tokens. [2025-11-13 01:55:12,699][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:33 [2025-11-13 01:55:13,448][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:55:13,450][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:55:13,451][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:55:14,365][__main__][INFO] - Iteration 257 took 53s (32.51% Gen, 65.79% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 56m 23s. Estimated total time: 44h 41m 56s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 23s, 500 more iterations: 7h 26m 59s. [2025-11-13 01:55:14,367][__main__][INFO] - Starting iteration 257. [2025-11-13 01:55:14,852][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 01:55:14,852][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:55:30,806][__main__][INFO] - Number of regex retries in iteration 257: 0 [2025-11-13 01:55:30,807][__main__][INFO] - agents played in iteration 257 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:55:31,644][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:55:31,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:55:31,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:55:31,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:55:31,713][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:55:31,714][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:55:32,365][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:55:32,825][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:55:33,332][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:55:33,837][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:55:34,340][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:55:34,844][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:55:35,347][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:55:35,849][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:55:36,351][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:55:36,860][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:55:37,368][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:55:37,875][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:55:38,383][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:55:38,885][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:55:39,388][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:55:39,890][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:55:40,395][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:55:40,904][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:55:41,407][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:55:41,910][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:55:42,412][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:55:42,916][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:55:43,422][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:55:43,925][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:55:44,427][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:55:44,928][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:55:45,429][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:55:45,930][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:55:46,434][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:55:46,935][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:55:47,438][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:55:47,943][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:55:48,448][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:55:48,960][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:55:49,469][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:55:49,981][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:55:50,487][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:55:50,992][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:55:51,502][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:55:52,007][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:55:52,514][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:55:53,021][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:55:53,529][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:55:54,037][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:55:54,542][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:55:55,049][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:55:55,558][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:55:56,064][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:55:56,571][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:55:57,085][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:55:57,588][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:55:58,107][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:55:58,613][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:55:59,121][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:55:59,625][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:56:00,130][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:56:00,635][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:56:01,138][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:56:01,642][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:56:02,147][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:56:02,649][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:56:03,153][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:56:03,658][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:56:04,163][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:56:04,668][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10829 tokens. [2025-11-13 01:56:05,392][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:33 [2025-11-13 01:56:06,143][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:56:06,147][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:56:06,149][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:56:07,080][__main__][INFO] - Iteration 258 took 52s (30.55% Gen, 67.67% Train). Generation: 15s, Training: 35s. Estimated remaining time: 39h 45m 0s. Estimated total time: 43h 31m 26s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 2s, 500 more iterations: 7h 15m 14s. [2025-11-13 01:56:07,082][__main__][INFO] - Starting iteration 258. [2025-11-13 01:56:07,557][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 01:56:07,558][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:56:11,892][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:56:23,083][__main__][INFO] - Number of regex retries in iteration 258: 1 [2025-11-13 01:56:23,084][__main__][INFO] - agents played in iteration 258 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:56:23,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:56:23,895][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:56:23,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:56:23,941][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:56:23,942][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:56:23,943][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:56:24,595][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:56:25,058][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:56:25,567][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:56:26,069][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:56:26,574][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:56:27,098][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:56:27,603][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:56:28,106][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:56:28,618][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:56:29,123][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:56:29,629][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:56:30,133][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:56:30,637][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:56:31,142][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:56:31,650][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:56:32,155][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:56:32,661][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:56:33,167][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:56:33,690][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:56:34,191][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:56:34,693][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:56:35,198][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:56:35,702][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:56:36,205][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:56:36,707][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:56:37,209][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:56:37,718][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:56:38,221][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:56:38,723][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:56:39,225][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:56:39,726][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:56:40,228][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:56:40,732][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:56:41,240][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:56:41,746][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:56:42,251][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:56:42,760][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:56:43,265][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:56:43,771][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:56:44,275][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:56:44,781][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:56:45,285][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:56:45,795][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:56:46,301][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:56:46,812][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:56:47,319][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:56:47,824][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:56:48,342][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:56:48,847][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:56:49,359][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:56:49,866][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:56:50,371][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:56:50,888][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:56:51,393][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:56:51,905][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:56:52,410][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:56:52,917][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:56:53,425][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:56:53,928][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:56:54,434][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:56:54,939][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:56:55,443][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:56:55,949][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:56:56,452][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:56:56,960][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10859 tokens. [2025-11-13 01:56:57,655][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 01:56:58,414][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:56:58,415][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:56:58,417][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:56:59,325][__main__][INFO] - Iteration 259 took 51s (29.99% Gen, 68.25% Train). Generation: 15s, Training: 35s. Estimated remaining time: 39h 21m 6s. Estimated total time: 43h 8m 23s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 16s, 500 more iterations: 7h 11m 23s. [2025-11-13 01:56:59,327][__main__][INFO] - Starting iteration 259. [2025-11-13 01:56:59,803][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 01:56:59,804][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:57:04,395][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:57:05,220][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:57:08,686][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:57:15,375][__main__][INFO] - Number of regex retries in iteration 259: 3 [2025-11-13 01:57:15,375][__main__][INFO] - agents played in iteration 259 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:57:16,160][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:57:16,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:57:16,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:57:16,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:57:16,235][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:57:16,236][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:57:16,878][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:57:17,338][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:57:17,851][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:57:18,354][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:57:18,862][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:57:19,366][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:57:19,875][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:57:20,378][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:57:20,882][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:57:21,404][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:57:21,909][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:57:22,413][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:57:22,915][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:57:23,416][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:57:23,921][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:57:24,423][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:57:24,928][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:57:25,436][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:57:25,939][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:57:26,441][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:57:26,944][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:57:27,445][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:57:27,945][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:57:28,447][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:57:28,949][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:57:29,459][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:57:29,963][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:57:30,465][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:57:30,979][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:57:31,482][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:57:32,000][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:57:32,505][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:57:33,010][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:57:33,512][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:57:34,018][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:57:34,526][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:57:35,028][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:57:35,531][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:57:36,038][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:57:36,544][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:57:37,049][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:57:37,554][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:57:38,062][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:57:38,576][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:57:39,081][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:57:39,587][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:57:40,099][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:57:40,603][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:57:41,114][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:57:41,618][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:57:42,121][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:57:42,627][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:57:43,131][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:57:43,635][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:57:44,141][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:57:44,646][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:57:45,152][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:57:45,655][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:57:46,159][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:57:46,663][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:57:47,167][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:57:47,674][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:57:48,179][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:57:48,682][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:57:49,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10827 tokens. [2025-11-13 01:57:49,910][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 01:57:50,656][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:57:50,658][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:57:50,660][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:57:51,563][__main__][INFO] - Iteration 260 took 51s (30.08% Gen, 68.17% Train). Generation: 15s, Training: 35s. Estimated remaining time: 39h 19m 51s. Estimated total time: 43h 8m 1s. Time estimates for 10 more iterations: 8m 37s, 100 more iterations: 1h 26m 16s, 500 more iterations: 7h 11m 20s. [2025-11-13 01:57:51,565][__main__][INFO] - Starting iteration 260. [2025-11-13 01:57:52,031][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 25 and human policies 1. [2025-11-13 01:57:52,032][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:58:08,605][__main__][INFO] - Number of regex retries in iteration 260: 0 [2025-11-13 01:58:08,606][__main__][INFO] - agents played in iteration 260 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:58:09,387][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:58:09,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:58:09,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:58:09,463][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:58:09,463][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:58:09,464][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:58:10,112][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:58:10,572][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:58:11,080][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:58:11,598][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:58:12,097][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:58:12,601][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:58:13,107][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:58:13,611][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:58:14,116][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:58:14,618][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:58:15,121][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:58:15,628][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:58:16,131][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:58:16,634][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:58:17,139][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:58:17,644][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:58:18,146][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:58:18,651][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:58:19,155][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:58:19,658][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:58:20,160][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:58:20,662][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:58:21,164][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:58:21,666][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:58:22,169][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:58:22,671][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:58:23,172][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:58:23,686][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:58:24,191][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:58:24,704][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:58:25,209][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:58:25,713][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:58:26,214][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:58:26,720][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:58:27,229][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:58:27,732][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:58:28,238][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:58:28,745][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:58:29,250][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:58:29,755][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:58:30,263][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:58:30,767][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:58:31,271][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:58:31,774][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:58:32,279][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:58:32,787][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:58:33,291][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:58:33,805][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:58:34,313][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:58:34,815][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:58:35,326][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:58:35,831][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:58:36,336][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:58:36,842][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:58:37,346][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:58:37,852][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:58:38,356][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:58:38,867][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:58:39,386][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:58:39,891][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:58:40,408][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:58:40,914][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:58:41,419][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:58:41,926][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:58:42,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10790 tokens. [2025-11-13 01:58:43,169][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.26%, Current % of VRAM taken: 58.50%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:33 [2025-11-13 01:58:43,937][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:58:43,938][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:58:43,940][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:58:45,771][__main__][INFO] - Iteration 261 took 53s (30.84% Gen, 65.75% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 57m 54s. Estimated total time: 44h 46m 58s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 33s, 500 more iterations: 7h 27m 49s. [2025-11-13 01:58:45,773][__main__][INFO] - Starting iteration 261. [2025-11-13 01:58:46,256][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 01:58:46,256][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:58:50,765][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:58:51,768][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:59:01,552][__main__][INFO] - Number of regex retries in iteration 261: 2 [2025-11-13 01:59:01,552][__main__][INFO] - agents played in iteration 261 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:59:02,384][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:59:02,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:59:02,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:59:02,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:59:02,456][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:59:02,457][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:59:03,136][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:59:03,599][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:59:04,109][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 01:59:04,613][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 01:59:05,116][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 01:59:05,621][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 01:59:06,126][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 01:59:06,629][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 01:59:07,134][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 01:59:07,638][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 01:59:08,142][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 01:59:08,646][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 01:59:09,149][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 01:59:09,657][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 01:59:10,164][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 01:59:10,668][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 01:59:11,180][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 01:59:11,684][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 01:59:12,206][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 01:59:12,711][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 01:59:13,221][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 01:59:13,725][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 01:59:14,226][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 01:59:14,732][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 01:59:15,235][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 01:59:15,741][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 01:59:16,245][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 01:59:16,748][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 01:59:17,256][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 01:59:17,762][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 01:59:18,268][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 01:59:18,778][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 01:59:19,284][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 01:59:19,787][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 01:59:20,298][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 01:59:20,805][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 01:59:21,324][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 01:59:21,830][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 01:59:22,336][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 01:59:22,841][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 01:59:23,346][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 01:59:23,850][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 01:59:24,351][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 01:59:24,852][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 01:59:25,357][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 01:59:25,858][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 01:59:26,363][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 01:59:26,868][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 01:59:27,372][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 01:59:27,881][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 01:59:28,387][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 01:59:28,892][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 01:59:29,403][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 01:59:29,908][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 01:59:30,417][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 01:59:30,923][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 01:59:31,428][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 01:59:31,931][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 01:59:32,438][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 01:59:32,942][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 01:59:33,446][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 01:59:33,952][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 01:59:34,458][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 01:59:34,961][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 01:59:35,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10873 tokens. [2025-11-13 01:59:36,193][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:33 [2025-11-13 01:59:36,943][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 01:59:36,945][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 01:59:36,946][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 01:59:37,832][__main__][INFO] - Iteration 262 took 51s (29.66% Gen, 68.62% Train). Generation: 15s, Training: 35s. Estimated remaining time: 39h 8m 54s. Estimated total time: 42h 58m 50s. Time estimates for 10 more iterations: 8m 35s, 100 more iterations: 1h 25m 57s, 500 more iterations: 7h 9m 48s. [2025-11-13 01:59:37,834][__main__][INFO] - Starting iteration 262. [2025-11-13 01:59:38,316][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 01:59:38,317][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 01:59:48,932][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 01:59:57,386][__main__][INFO] - Number of regex retries in iteration 262: 1 [2025-11-13 01:59:57,387][__main__][INFO] - agents played in iteration 262 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 01:59:58,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:59:58,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:59:58,284][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:59:58,306][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 01:59:58,307][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 01:59:58,308][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 01:59:58,960][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 01:59:59,425][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 01:59:59,934][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:00:00,442][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:00:00,950][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:00:01,454][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:00:01,961][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:00:02,466][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:00:02,973][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:00:03,479][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:00:03,984][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:00:04,489][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:00:04,994][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:00:05,507][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:00:06,011][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:00:06,521][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:00:07,029][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:00:07,541][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:00:08,048][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:00:08,557][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:00:09,060][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:00:09,567][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:00:10,074][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:00:10,582][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:00:11,087][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:00:11,602][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:00:12,111][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:00:12,622][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:00:13,129][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:00:13,634][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:00:14,143][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:00:14,649][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:00:15,152][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:00:15,653][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:00:16,157][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:00:16,661][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:00:17,166][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:00:17,669][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:00:18,181][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:00:18,686][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:00:19,190][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:00:19,696][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:00:20,201][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:00:20,711][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:00:21,221][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:00:21,730][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:00:22,241][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:00:22,746][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:00:23,255][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:00:23,763][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:00:24,268][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:00:24,775][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:00:25,281][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:00:25,786][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:00:26,292][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:00:26,796][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:00:27,304][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:00:27,806][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:00:28,308][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:00:28,818][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:00:29,319][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:00:29,821][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:00:30,326][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:00:30,829][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:00:31,347][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10835 tokens. [2025-11-13 02:00:32,023][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.50%, Block Peak % of device VRAM: 62.35%, ΔTime: 00:00:33 [2025-11-13 02:00:32,781][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:00:32,782][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:00:32,784][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:00:33,727][__main__][INFO] - Iteration 263 took 55s (34.41% Gen, 63.88% Train). Generation: 19s, Training: 35s. Estimated remaining time: 42h 19m 42s. Estimated total time: 46h 10m 35s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 21s, 500 more iterations: 7h 41m 45s. [2025-11-13 02:00:33,729][__main__][INFO] - Starting iteration 263. [2025-11-13 02:00:34,201][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 02:00:34,202][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:00:38,347][mllm.models.large_language_model_local][WARNING] - Response Proposal: 5 hats, 5 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:00:38,587][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:00:49,892][__main__][INFO] - Number of regex retries in iteration 263: 2 [2025-11-13 02:00:49,893][__main__][INFO] - agents played in iteration 263 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:00:50,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:00:50,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:00:50,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:00:50,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:00:50,809][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:00:50,809][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:00:51,464][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:00:51,926][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:00:52,435][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:00:52,941][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:00:53,445][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:00:53,949][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:00:54,453][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:00:54,961][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:00:55,464][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:00:55,985][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:00:56,488][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:00:56,994][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:00:57,497][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:00:58,001][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:00:58,504][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:00:59,032][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:00:59,543][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:01:00,048][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:01:00,549][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:01:01,055][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:01:01,563][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:01:02,068][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:01:02,575][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:01:03,082][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:01:03,587][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:01:04,094][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:01:04,599][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:01:05,105][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:01:05,622][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:01:06,128][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:01:06,640][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:01:07,144][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:01:07,648][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:01:08,159][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:01:08,664][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:01:09,167][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:01:09,672][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:01:10,180][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:01:10,686][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:01:11,189][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:01:11,696][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:01:12,202][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:01:12,708][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:01:13,218][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:01:13,730][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:01:14,237][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:01:14,750][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:01:15,253][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:01:15,764][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:01:16,277][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:01:16,784][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:01:17,290][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:01:17,793][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:01:18,299][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:01:18,816][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:01:19,325][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:01:19,834][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:01:20,340][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:01:20,843][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:01:21,357][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:01:21,858][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:01:22,361][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:01:22,862][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:01:23,363][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:01:23,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10815 tokens. [2025-11-13 02:01:24,601][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 02:01:25,389][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:01:25,390][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:01:25,392][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:01:26,317][__main__][INFO] - Iteration 264 took 52s (30.11% Gen, 68.11% Train). Generation: 15s, Training: 35s. Estimated remaining time: 39h 34m 5s. Estimated total time: 43h 25m 50s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 51s, 500 more iterations: 7h 14m 18s. [2025-11-13 02:01:26,320][__main__][INFO] - Starting iteration 264. [2025-11-13 02:01:26,805][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 02:01:26,806][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:01:43,066][__main__][INFO] - Number of regex retries in iteration 264: 0 [2025-11-13 02:01:43,066][__main__][INFO] - agents played in iteration 264 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:01:43,855][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:01:43,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:01:43,907][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:01:43,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:01:43,929][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:01:43,931][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:01:44,578][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:01:45,042][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:01:45,550][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:01:46,052][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:01:46,563][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:01:47,068][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:01:47,575][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:01:48,080][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:01:48,586][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:01:49,093][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:01:49,595][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:01:50,100][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:01:50,603][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:01:51,106][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:01:51,610][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:01:52,112][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:01:52,617][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:01:53,121][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:01:53,627][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:01:54,148][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:01:54,653][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:01:55,161][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:01:55,667][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:01:56,173][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:01:56,682][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:01:57,191][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:01:57,698][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:01:58,205][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:01:58,710][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:01:59,215][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:01:59,726][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:02:00,230][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:02:00,745][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:02:01,248][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:02:01,751][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:02:02,257][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:02:02,763][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:02:03,272][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:02:03,772][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:02:04,276][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:02:04,786][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:02:05,289][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:02:05,797][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:02:06,298][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:02:06,801][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:02:07,308][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:02:07,816][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:02:08,320][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:02:08,836][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:02:09,345][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:02:09,850][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:02:10,358][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:02:10,866][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:02:11,387][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:02:11,897][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:02:12,404][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:02:12,912][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:02:13,421][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:02:13,930][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:02:14,439][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:02:14,944][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:02:15,455][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:02:15,960][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:02:16,474][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:02:16,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10793 tokens. [2025-11-13 02:02:17,673][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.27%, Current % of VRAM taken: 58.51%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:33 [2025-11-13 02:02:18,425][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:02:18,426][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:02:18,428][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:02:19,379][__main__][INFO] - Iteration 265 took 52s (30.93% Gen, 67.26% Train). Generation: 16s, Training: 35s. Estimated remaining time: 39h 56m 4s. Estimated total time: 43h 48m 42s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 37s, 500 more iterations: 7h 18m 7s. [2025-11-13 02:02:19,381][__main__][INFO] - Starting iteration 265. [2025-11-13 02:02:19,878][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 02:02:19,879][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:02:36,712][__main__][INFO] - Number of regex retries in iteration 265: 0 [2025-11-13 02:02:36,713][__main__][INFO] - agents played in iteration 265 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:02:37,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:02:37,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:02:37,606][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:02:37,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:02:37,629][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:02:37,629][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:02:38,285][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:02:38,746][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:02:39,253][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:02:39,760][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:02:40,282][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:02:40,786][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:02:41,289][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:02:41,792][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:02:42,296][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:02:42,800][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:02:43,304][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:02:43,809][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:02:44,314][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:02:44,817][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:02:45,321][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:02:45,825][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:02:46,330][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:02:46,837][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:02:47,343][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:02:47,848][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:02:48,368][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:02:48,871][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:02:49,383][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:02:49,889][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:02:50,395][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:02:50,906][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:02:51,411][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:02:51,916][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:02:52,428][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:02:52,934][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:02:53,438][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:02:53,941][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:02:54,445][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:02:54,963][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:02:55,468][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:02:55,972][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:02:56,479][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:02:56,983][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:02:57,488][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:02:57,992][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:02:58,496][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:02:59,002][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:02:59,505][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:03:00,010][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:03:00,515][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:03:01,019][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:03:01,528][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:03:02,034][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:03:02,539][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:03:03,055][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:03:03,563][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:03:04,070][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:03:04,575][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:03:05,083][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:03:05,599][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:03:06,104][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:03:06,607][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:03:07,110][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:03:07,612][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:03:08,117][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:03:08,618][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:03:09,119][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:03:09,623][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:03:10,123][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:03:10,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10768 tokens. [2025-11-13 02:03:11,323][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 02:03:12,080][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:03:12,083][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:03:12,085][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:03:13,013][__main__][INFO] - Iteration 266 took 53s (31.68% Gen, 66.57% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 23m 15s. Estimated total time: 44h 16m 46s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 33s, 500 more iterations: 7h 22m 47s. [2025-11-13 02:03:13,015][__main__][INFO] - Starting iteration 266. [2025-11-13 02:03:13,507][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 02:03:13,507][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:03:30,964][__main__][INFO] - Number of regex retries in iteration 266: 0 [2025-11-13 02:03:30,965][__main__][INFO] - agents played in iteration 266 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:03:31,811][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:03:31,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:03:31,859][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:03:31,881][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:03:31,881][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:03:31,882][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:03:32,610][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:03:33,073][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:03:33,584][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:03:34,093][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:03:34,593][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:03:35,098][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:03:35,601][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:03:36,117][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:03:36,617][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:03:37,121][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:03:37,624][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:03:38,127][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:03:38,635][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:03:39,142][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:03:39,646][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:03:40,153][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:03:40,659][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:03:41,167][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:03:41,671][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:03:42,173][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:03:42,690][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:03:43,194][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:03:43,700][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:03:44,207][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:03:44,711][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:03:45,226][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:03:45,733][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:03:46,238][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:03:46,741][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:03:47,246][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:03:47,752][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:03:48,258][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:03:48,767][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:03:49,272][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:03:49,776][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:03:50,281][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:03:50,794][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:03:51,296][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:03:51,810][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:03:52,313][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:03:52,817][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:03:53,325][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:03:53,831][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:03:54,340][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:03:54,870][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:03:55,378][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:03:55,885][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:03:56,393][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:03:56,901][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:03:57,409][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:03:57,915][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:03:58,424][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:03:58,930][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:03:59,434][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:03:59,941][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:04:00,448][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:04:00,956][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:04:01,463][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:04:01,964][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:04:02,468][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:04:02,971][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:04:03,474][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:04:03,976][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:04:04,479][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:04:04,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10787 tokens. [2025-11-13 02:04:05,702][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.50%, Block Peak % of device VRAM: 62.50%, ΔTime: 00:00:33 [2025-11-13 02:04:06,474][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:04:06,475][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:04:06,477][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:04:07,387][__main__][INFO] - Iteration 267 took 53s (32.40% Gen, 65.91% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 59m 38s. Estimated total time: 44h 54m 3s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 48s, 500 more iterations: 7h 29m 0s. [2025-11-13 02:04:07,390][__main__][INFO] - Starting iteration 267. [2025-11-13 02:04:07,868][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 02:04:07,869][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:04:15,332][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:04:23,317][__main__][INFO] - Number of regex retries in iteration 267: 1 [2025-11-13 02:04:23,317][__main__][INFO] - agents played in iteration 267 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:04:24,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:04:24,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:04:24,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:04:24,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:04:24,259][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:04:24,260][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:04:24,945][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:04:25,410][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:04:25,925][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:04:26,429][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:04:26,938][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:04:27,445][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:04:27,948][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:04:28,451][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:04:28,952][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:04:29,455][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:04:29,958][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:04:30,462][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:04:30,967][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:04:31,469][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:04:31,973][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:04:32,487][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:04:32,989][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:04:33,497][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:04:34,002][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:04:34,506][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:04:35,015][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:04:35,521][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:04:36,026][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:04:36,533][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:04:37,039][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:04:37,544][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:04:38,048][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:04:38,552][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:04:39,060][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:04:39,572][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:04:40,077][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:04:40,595][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:04:41,102][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:04:41,613][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:04:42,119][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:04:42,625][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:04:43,130][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:04:43,641][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:04:44,151][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:04:44,657][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:04:45,162][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:04:45,669][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:04:46,176][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:04:46,685][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:04:47,193][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:04:47,699][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:04:48,221][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:04:48,732][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:04:49,243][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:04:49,753][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:04:50,260][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:04:50,770][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:04:51,281][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:04:51,788][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:04:52,316][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:04:52,823][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:04:53,331][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:04:53,836][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:04:54,342][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:04:54,847][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:04:55,351][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:04:55,856][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:04:56,362][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:04:56,865][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:04:57,372][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10796 tokens. [2025-11-13 02:04:58,059][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.27%, Current % of VRAM taken: 58.52%, Block Peak % of device VRAM: 62.45%, ΔTime: 00:00:33 [2025-11-13 02:04:58,829][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:04:58,831][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:04:58,833][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:04:59,833][__main__][INFO] - Iteration 268 took 51s (29.73% Gen, 68.35% Train). Generation: 15s, Training: 35s. Estimated remaining time: 39h 22m 56s. Estimated total time: 43h 18m 14s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 36s, 500 more iterations: 7h 13m 2s. [2025-11-13 02:04:59,835][__main__][INFO] - Starting iteration 268. [2025-11-13 02:05:00,331][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 02:05:00,331][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:05:12,613][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:05:13,426][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:05:17,615][__main__][INFO] - Number of regex retries in iteration 268: 2 [2025-11-13 02:05:17,616][__main__][INFO] - agents played in iteration 268 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:05:18,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:05:18,435][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:05:18,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:05:18,479][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:05:18,480][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:05:18,480][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:05:19,173][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:05:19,640][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:05:20,149][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:05:20,654][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:05:21,162][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:05:21,670][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:05:22,175][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:05:22,689][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:05:23,193][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:05:23,704][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:05:24,209][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:05:24,715][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:05:25,231][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:05:25,736][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:05:26,241][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:05:26,746][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:05:27,253][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:05:27,763][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:05:28,269][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:05:28,775][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:05:29,282][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:05:29,787][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:05:30,305][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:05:30,809][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:05:31,312][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:05:31,818][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:05:32,322][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:05:32,827][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:05:33,331][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:05:33,839][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:05:34,344][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:05:34,849][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:05:35,354][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:05:35,854][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:05:36,360][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:05:36,868][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:05:37,372][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:05:37,876][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:05:38,386][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:05:38,889][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:05:39,397][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:05:39,899][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:05:40,399][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:05:40,920][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:05:41,425][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:05:41,930][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:05:42,438][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:05:42,941][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:05:43,450][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:05:43,958][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:05:44,463][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:05:44,972][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:05:45,478][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:05:45,986][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:05:46,490][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:05:46,995][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:05:47,500][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:05:48,005][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:05:48,513][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:05:49,022][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:05:49,526][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:05:50,044][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:05:50,549][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:05:51,053][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:05:51,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10671 tokens. [2025-11-13 02:05:52,282][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.33%, Current % of VRAM taken: 58.57%, Block Peak % of device VRAM: 62.49%, ΔTime: 00:00:33 [2025-11-13 02:05:53,065][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:05:53,067][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:05:53,068][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:05:54,016][__main__][INFO] - Iteration 269 took 53s (32.19% Gen, 66.04% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 48m 4s. Estimated total time: 44h 44m 17s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 28s, 500 more iterations: 7h 27m 22s. [2025-11-13 02:05:54,018][__main__][INFO] - Starting iteration 269. [2025-11-13 02:05:54,499][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 02:05:54,500][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:06:06,777][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 1 y books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:06:10,731][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 11 books, 9 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:06:12,973][__main__][INFO] - Number of regex retries in iteration 269: 2 [2025-11-13 02:06:12,974][__main__][INFO] - agents played in iteration 269 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:06:13,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:06:13,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:06:13,829][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:06:13,852][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:06:13,853][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:06:13,853][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:06:14,561][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:06:15,039][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:06:15,548][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:06:16,054][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:06:16,563][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:06:17,069][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:06:17,579][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:06:18,085][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:06:18,590][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:06:19,105][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:06:19,612][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:06:20,121][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:06:20,627][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:06:21,134][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:06:21,649][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:06:22,160][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:06:22,669][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:06:23,177][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:06:23,685][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:06:24,188][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:06:24,692][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:06:25,195][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:06:25,711][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:06:26,215][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:06:26,732][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:06:27,238][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:06:27,742][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:06:28,246][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:06:28,754][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:06:29,258][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:06:29,763][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:06:30,267][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:06:30,774][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:06:31,280][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:06:31,787][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:06:32,295][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:06:32,800][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:06:33,307][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:06:33,815][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:06:34,323][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:06:34,843][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:06:35,346][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:06:35,854][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:06:36,355][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:06:36,861][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:06:37,369][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:06:37,873][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:06:38,376][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:06:38,883][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:06:39,386][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:06:39,894][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:06:40,400][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:06:40,911][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:06:41,426][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:06:41,934][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:06:42,447][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:06:42,951][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:06:43,455][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:06:43,958][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:06:44,463][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:06:44,968][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:06:45,471][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:06:45,973][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:06:46,482][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:06:46,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10808 tokens. [2025-11-13 02:06:47,703][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 02:06:48,466][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:06:48,468][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:06:48,470][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:06:49,420][__main__][INFO] - Iteration 270 took 54s (33.64% Gen, 64.63% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 48m 58s. Estimated total time: 45h 46m 6s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 32s, 500 more iterations: 7h 37m 41s. [2025-11-13 02:06:49,422][__main__][INFO] - Starting iteration 270. [2025-11-13 02:06:49,902][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 26 and human policies 1. [2025-11-13 02:06:49,902][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:06:54,050][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:07:06,858][__main__][INFO] - Number of regex retries in iteration 270: 1 [2025-11-13 02:07:06,859][__main__][INFO] - agents played in iteration 270 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:07:07,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:07:07,770][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:07:07,793][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:07:07,816][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:07:07,816][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:07:07,817][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:07:08,530][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:07:08,995][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:07:09,505][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:07:10,027][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:07:10,534][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:07:11,041][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:07:11,549][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:07:12,056][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:07:12,563][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:07:13,068][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:07:13,573][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:07:14,083][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:07:14,587][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:07:15,090][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:07:15,594][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:07:16,098][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:07:16,604][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:07:17,106][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:07:17,609][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:07:18,113][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:07:18,616][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:07:19,118][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:07:19,622][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:07:20,123][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:07:20,627][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:07:21,135][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:07:21,642][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:07:22,149][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:07:22,654][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:07:23,165][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:07:23,671][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:07:24,177][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:07:24,684][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:07:25,189][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:07:25,693][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:07:26,201][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:07:26,702][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:07:27,205][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:07:27,708][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:07:28,208][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:07:28,709][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:07:29,211][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:07:29,715][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:07:30,216][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:07:30,722][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:07:31,226][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:07:31,731][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:07:32,253][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:07:32,757][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:07:33,261][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:07:33,774][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:07:34,279][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:07:34,784][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:07:35,293][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:07:35,795][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:07:36,302][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:07:36,802][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:07:37,304][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:07:37,805][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:07:38,307][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:07:38,810][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:07:39,311][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:07:39,812][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:07:40,317][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:07:40,818][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10771 tokens. [2025-11-13 02:07:41,528][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.36%, ΔTime: 00:00:33 [2025-11-13 02:07:42,273][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:07:42,275][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:07:42,277][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:07:44,098][__main__][INFO] - Iteration 271 took 54s (31.29% Gen, 65.35% Train). Generation: 16s, Training: 35s. Estimated remaining time: 41h 11m 48s. Estimated total time: 45h 9m 50s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 19s, 500 more iterations: 7h 31m 38s. [2025-11-13 02:07:44,100][__main__][INFO] - Starting iteration 271. [2025-11-13 02:07:44,567][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 02:07:44,568][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:07:49,105][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:08:01,676][__main__][INFO] - Number of regex retries in iteration 271: 1 [2025-11-13 02:08:01,676][__main__][INFO] - agents played in iteration 271 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:08:02,516][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:08:02,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:08:02,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:08:02,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:08:02,583][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:08:02,584][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:08:03,299][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:08:03,761][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:08:04,271][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:08:04,776][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:08:05,284][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:08:05,788][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:08:06,294][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:08:06,798][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:08:07,303][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:08:07,806][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:08:08,308][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:08:08,815][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:08:09,321][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:08:09,824][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:08:10,340][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:08:10,843][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:08:11,364][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:08:11,865][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:08:12,369][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:08:12,876][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:08:13,381][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:08:13,888][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:08:14,392][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:08:14,897][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:08:15,403][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:08:15,911][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:08:16,418][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:08:16,924][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:08:17,431][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:08:17,951][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:08:18,459][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:08:18,967][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:08:19,472][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:08:19,981][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:08:20,485][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:08:20,990][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:08:21,493][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:08:22,001][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:08:22,523][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:08:23,026][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:08:23,531][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:08:24,037][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:08:24,554][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:08:25,060][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:08:25,566][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:08:26,070][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:08:26,577][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:08:27,080][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:08:27,585][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:08:28,089][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:08:28,598][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:08:29,101][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:08:29,605][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:08:30,106][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:08:30,609][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:08:31,113][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:08:31,615][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:08:32,117][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:08:32,618][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:08:33,123][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:08:33,629][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:08:34,131][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:08:34,636][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:08:35,140][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:08:35,643][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10787 tokens. [2025-11-13 02:08:36,355][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 02:08:37,110][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:08:37,112][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:08:37,114][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:08:38,030][__main__][INFO] - Iteration 272 took 53s (32.00% Gen, 66.29% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 34m 11s. Estimated total time: 44h 33m 7s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 6s, 500 more iterations: 7h 25m 31s. [2025-11-13 02:08:38,032][__main__][INFO] - Starting iteration 272. [2025-11-13 02:08:38,520][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 02:08:38,522][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:08:56,157][__main__][INFO] - Number of regex retries in iteration 272: 0 [2025-11-13 02:08:56,157][__main__][INFO] - agents played in iteration 272 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:08:57,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:08:57,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:08:57,062][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:08:57,096][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:08:57,097][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:08:57,097][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:08:57,796][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:08:58,260][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:08:58,769][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:08:59,271][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:08:59,775][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:09:00,281][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:09:00,785][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:09:01,293][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:09:01,796][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:09:02,303][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:09:02,805][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:09:03,307][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:09:03,812][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:09:04,314][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:09:04,818][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:09:05,324][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:09:05,828][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:09:06,334][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:09:06,841][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:09:07,345][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:09:07,864][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:09:08,370][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:09:08,876][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:09:09,381][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:09:09,884][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:09:10,399][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:09:10,902][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:09:11,405][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:09:11,908][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:09:12,411][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:09:12,914][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:09:13,420][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:09:13,929][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:09:14,431][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:09:14,931][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:09:15,432][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:09:15,937][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:09:16,434][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:09:16,946][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:09:17,459][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:09:17,964][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:09:18,480][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:09:18,989][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:09:19,494][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:09:20,000][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:09:20,502][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:09:21,008][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:09:21,510][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:09:22,016][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:09:22,529][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:09:23,041][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:09:23,549][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:09:24,067][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:09:24,572][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:09:25,092][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:09:25,598][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:09:26,103][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:09:26,607][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:09:27,112][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:09:27,616][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:09:28,116][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:09:28,619][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:09:29,123][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:09:29,628][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:09:30,132][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10781 tokens. [2025-11-13 02:09:30,828][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:33 [2025-11-13 02:09:31,601][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:09:31,602][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:09:31,604][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:09:32,558][__main__][INFO] - Iteration 273 took 54s (32.63% Gen, 65.60% Train). Generation: 17s, Training: 35s. Estimated remaining time: 41h 2m 2s. Estimated total time: 45h 1m 53s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 3s, 500 more iterations: 7h 30m 18s. [2025-11-13 02:09:32,560][__main__][INFO] - Starting iteration 273. [2025-11-13 02:09:33,056][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 02:09:33,056][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:09:39,098][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:09:52,215][__main__][INFO] - Number of regex retries in iteration 273: 1 [2025-11-13 02:09:52,216][__main__][INFO] - agents played in iteration 273 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:09:53,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:09:53,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:09:53,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:09:53,138][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:09:53,139][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:09:53,139][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:09:53,807][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:09:54,269][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:09:54,779][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:09:55,299][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:09:55,800][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:09:56,304][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:09:56,809][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:09:57,313][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:09:57,822][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:09:58,325][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:09:58,829][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:09:59,334][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:09:59,840][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:10:00,347][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:10:00,853][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:10:01,360][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:10:01,866][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:10:02,370][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:10:02,873][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:10:03,375][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:10:03,880][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:10:04,391][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:10:04,893][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:10:05,398][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:10:05,900][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:10:06,400][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:10:06,902][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:10:07,405][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:10:07,907][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:10:08,408][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:10:08,906][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:10:09,403][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:10:09,903][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:10:10,407][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:10:10,912][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:10:11,421][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:10:11,930][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:10:12,440][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:10:12,949][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:10:13,465][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:10:13,969][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:10:14,474][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:10:14,983][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:10:15,488][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:10:15,997][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:10:16,503][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:10:17,009][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:10:17,516][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:10:18,022][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:10:18,527][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:10:19,028][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:10:19,531][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:10:20,040][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:10:20,542][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:10:21,043][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:10:21,546][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:10:22,049][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:10:22,560][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:10:23,064][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:10:23,567][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:10:24,073][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:10:24,582][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:10:25,088][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:10:25,592][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:10:26,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10731 tokens. [2025-11-13 02:10:26,810][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:33 [2025-11-13 02:10:27,575][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:10:27,577][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:10:27,579][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:10:28,508][__main__][INFO] - Iteration 274 took 55s (34.55% Gen, 63.77% Train). Generation: 19s, Training: 35s. Estimated remaining time: 42h 11m 50s. Estimated total time: 46h 12m 37s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 25s, 500 more iterations: 7h 42m 6s. [2025-11-13 02:10:28,510][__main__][INFO] - Starting iteration 274. [2025-11-13 02:10:28,983][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 02:10:28,984][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:10:42,561][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:10:45,967][__main__][INFO] - Number of regex retries in iteration 274: 1 [2025-11-13 02:10:45,968][__main__][INFO] - agents played in iteration 274 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:10:46,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:10:46,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:10:46,817][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:10:46,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:10:46,839][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:10:46,840][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:10:47,559][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:10:48,025][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:10:48,534][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:10:49,043][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:10:49,550][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:10:50,057][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:10:50,563][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:10:51,071][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:10:51,582][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:10:52,089][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:10:52,595][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:10:53,098][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:10:53,602][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:10:54,109][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:10:54,612][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:10:55,115][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:10:55,621][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:10:56,125][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:10:56,627][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:10:57,130][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:10:57,631][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:10:58,135][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:10:58,640][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:10:59,144][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:10:59,648][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:11:00,151][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:11:00,664][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:11:01,167][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:11:01,670][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:11:02,172][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:11:02,676][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:11:03,190][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:11:03,693][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:11:04,199][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:11:04,704][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:11:05,202][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:11:05,703][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:11:06,209][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:11:06,713][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:11:07,220][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:11:07,727][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:11:08,229][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:11:08,733][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:11:09,237][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:11:09,743][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:11:10,248][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:11:10,753][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:11:11,265][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:11:11,771][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:11:12,274][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:11:12,783][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:11:13,285][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:11:13,802][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:11:14,308][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:11:14,811][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:11:15,315][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:11:15,817][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:11:16,328][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:11:16,832][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:11:17,335][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:11:17,856][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:11:18,360][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:11:18,868][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:11:19,379][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:11:19,887][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10713 tokens. [2025-11-13 02:11:20,598][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.27%, Current % of VRAM taken: 58.51%, Block Peak % of device VRAM: 62.33%, ΔTime: 00:00:33 [2025-11-13 02:11:21,348][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:11:21,350][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:11:21,351][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:11:22,308][__main__][INFO] - Iteration 275 took 53s (31.85% Gen, 66.35% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 24m 35s. Estimated total time: 44h 26m 15s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 52s, 500 more iterations: 7h 24m 22s. [2025-11-13 02:11:22,310][__main__][INFO] - Starting iteration 275. [2025-11-13 02:11:22,788][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 02:11:22,789][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:11:29,525][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:11:34,519][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:11:41,508][__main__][INFO] - Number of regex retries in iteration 275: 2 [2025-11-13 02:11:41,509][__main__][INFO] - agents played in iteration 275 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:11:42,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:11:42,376][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:11:42,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:11:42,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:11:42,421][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:11:42,422][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:11:43,144][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:11:43,613][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:11:44,127][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:11:44,634][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:11:45,135][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:11:45,649][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:11:46,154][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:11:46,662][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:11:47,168][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:11:47,670][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:11:48,191][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:11:48,697][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:11:49,201][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:11:49,705][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:11:50,209][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:11:50,714][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:11:51,219][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:11:51,725][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:11:52,231][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:11:52,735][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:11:53,240][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:11:53,745][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:11:54,251][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:11:54,756][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:11:55,262][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:11:55,768][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:11:56,283][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:11:56,790][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:11:57,302][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:11:57,805][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:11:58,311][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:11:58,818][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:11:59,321][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:11:59,824][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:12:00,329][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:12:00,835][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:12:01,343][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:12:01,849][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:12:02,355][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:12:02,861][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:12:03,367][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:12:03,873][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:12:04,380][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:12:04,886][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:12:05,403][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:12:05,908][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:12:06,416][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:12:06,921][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:12:07,425][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:12:07,927][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:12:08,429][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:12:08,931][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:12:09,433][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:12:09,935][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:12:10,436][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:12:10,939][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:12:11,443][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:12:11,947][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:12:12,452][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:12:12,955][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:12:13,459][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:12:13,963][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:12:14,468][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:12:14,971][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:12:15,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10666 tokens. [2025-11-13 02:12:16,153][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 02:12:16,920][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:12:16,921][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:12:16,923][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:12:17,857][__main__][INFO] - Iteration 276 took 55s (33.99% Gen, 64.31% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 50m 51s. Estimated total time: 45h 53m 27s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 46s, 500 more iterations: 7h 38m 54s. [2025-11-13 02:12:17,859][__main__][INFO] - Starting iteration 276. [2025-11-13 02:12:18,339][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 02:12:18,340][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:12:23,461][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:12:36,775][__main__][INFO] - Number of regex retries in iteration 276: 1 [2025-11-13 02:12:36,776][__main__][INFO] - agents played in iteration 276 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:12:37,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:12:37,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:12:37,717][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:12:37,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:12:37,752][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:12:37,752][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:12:38,434][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:12:38,897][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:12:39,406][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:12:39,911][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:12:40,414][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:12:40,919][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:12:41,425][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:12:41,939][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:12:42,441][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:12:42,954][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:12:43,462][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:12:43,968][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:12:44,478][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:12:44,984][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:12:45,492][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:12:46,008][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:12:46,515][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:12:47,023][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:12:47,529][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:12:48,036][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:12:48,549][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:12:49,055][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:12:49,561][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:12:50,066][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:12:50,573][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:12:51,083][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:12:51,590][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:12:52,099][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:12:52,611][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:12:53,113][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:12:53,626][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:12:54,134][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:12:54,642][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:12:55,153][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:12:55,657][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:12:56,161][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:12:56,668][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:12:57,175][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:12:57,686][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:12:58,191][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:12:58,695][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:12:59,209][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:12:59,714][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:13:00,233][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:13:00,737][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:13:01,242][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:13:01,744][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:13:02,246][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:13:02,753][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:13:03,255][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:13:03,759][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:13:04,266][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:13:04,768][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:13:05,270][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:13:05,772][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:13:06,272][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:13:06,777][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:13:07,281][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:13:07,785][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:13:08,302][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:13:08,808][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:13:09,328][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:13:09,831][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:13:10,337][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:13:10,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10677 tokens. [2025-11-13 02:13:11,580][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.50%, ΔTime: 00:00:33 [2025-11-13 02:13:12,346][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:13:12,348][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:13:12,350][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:13:13,242][__main__][INFO] - Iteration 277 took 54s (33.58% Gen, 64.79% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 41m 36s. Estimated total time: 45h 45m 8s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 30s, 500 more iterations: 7h 37m 31s. [2025-11-13 02:13:13,245][__main__][INFO] - Starting iteration 277. [2025-11-13 02:13:13,717][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 02:13:13,718][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:13:30,828][__main__][INFO] - Number of regex retries in iteration 277: 0 [2025-11-13 02:13:30,828][__main__][INFO] - agents played in iteration 277 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:13:31,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:13:31,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:13:31,690][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:13:31,712][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:13:31,713][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:13:31,714][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:13:32,358][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:13:32,814][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:13:33,320][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:13:33,824][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:13:34,344][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:13:34,846][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:13:35,347][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:13:35,855][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:13:36,359][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:13:36,863][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:13:37,365][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:13:37,867][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:13:38,370][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:13:38,871][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:13:39,373][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:13:39,875][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:13:40,378][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:13:40,885][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:13:41,405][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:13:41,908][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:13:42,421][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:13:42,925][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:13:43,433][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:13:43,939][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:13:44,442][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:13:44,945][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:13:45,449][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:13:45,954][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:13:46,457][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:13:46,960][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:13:47,468][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:13:47,967][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:13:48,473][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:13:48,979][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:13:49,483][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:13:49,989][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:13:50,495][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:13:50,998][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:13:51,504][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:13:52,009][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:13:52,515][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:13:53,025][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:13:53,527][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:13:54,029][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:13:54,539][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:13:55,043][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:13:55,555][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:13:56,060][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:13:56,562][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:13:57,069][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:13:57,571][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:13:58,072][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:13:58,573][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:13:59,079][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:13:59,587][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:14:00,091][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:14:00,594][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:14:01,095][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:14:01,597][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:14:02,103][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:14:02,606][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:14:03,109][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:14:03,614][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:14:04,114][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:14:04,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10697 tokens. [2025-11-13 02:14:05,357][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 02:14:06,118][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:14:06,120][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:14:06,121][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:14:06,980][__main__][INFO] - Iteration 278 took 53s (32.12% Gen, 66.26% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 18m 46s. Estimated total time: 44h 23m 11s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 46s, 500 more iterations: 7h 23m 51s. [2025-11-13 02:14:06,982][__main__][INFO] - Starting iteration 278. [2025-11-13 02:14:07,468][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 02:14:07,468][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:14:12,084][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:14:25,231][__main__][INFO] - Number of regex retries in iteration 278: 1 [2025-11-13 02:14:25,232][__main__][INFO] - agents played in iteration 278 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:14:26,091][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:14:26,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:14:26,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:14:26,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:14:26,163][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:14:26,164][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:14:26,830][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:14:27,294][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:14:27,800][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:14:28,301][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:14:28,808][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:14:29,311][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:14:29,816][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:14:30,320][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:14:30,823][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:14:31,326][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:14:31,833][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:14:32,333][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:14:32,837][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:14:33,340][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:14:33,866][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:14:34,371][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:14:34,873][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:14:35,383][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:14:35,889][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:14:36,396][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:14:36,903][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:14:37,409][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:14:37,912][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:14:38,413][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:14:38,922][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:14:39,431][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:14:39,937][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:14:40,449][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:14:40,950][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:14:41,452][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:14:41,954][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:14:42,456][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:14:42,961][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:14:43,471][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:14:43,981][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:14:44,487][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:14:44,993][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:14:45,501][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:14:46,007][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:14:46,510][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:14:47,014][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:14:47,521][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:14:48,024][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:14:48,538][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:14:49,043][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:14:49,549][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:14:50,054][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:14:50,554][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:14:51,056][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:14:51,562][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:14:52,069][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:14:52,583][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:14:53,090][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:14:53,595][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:14:54,101][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:14:54,606][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:14:55,116][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:14:55,644][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:14:56,159][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:14:56,668][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:14:57,178][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:14:57,687][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:14:58,192][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:14:58,698][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:14:59,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10670 tokens. [2025-11-13 02:14:59,894][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.29%, Current % of VRAM taken: 58.53%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 02:15:00,652][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:15:00,654][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:15:00,656][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:15:01,618][__main__][INFO] - Iteration 279 took 54s (32.80% Gen, 65.42% Train). Generation: 17s, Training: 35s. Estimated remaining time: 41h 2m 12s. Estimated total time: 45h 7m 32s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 15s, 500 more iterations: 7h 31m 15s. [2025-11-13 02:15:01,620][__main__][INFO] - Starting iteration 279. [2025-11-13 02:15:02,089][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 02:15:02,090][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:15:08,396][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:15:18,886][__main__][INFO] - Number of regex retries in iteration 279: 1 [2025-11-13 02:15:18,886][__main__][INFO] - agents played in iteration 279 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:15:19,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:15:19,755][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:15:19,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:15:19,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:15:19,814][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:15:19,815][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:15:20,488][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:15:20,948][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:15:21,461][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:15:21,962][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:15:22,469][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:15:22,975][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:15:23,478][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:15:23,988][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:15:24,491][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:15:24,997][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:15:25,502][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:15:26,006][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:15:26,509][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:15:27,012][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:15:27,515][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:15:28,022][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:15:28,526][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:15:29,029][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:15:29,533][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:15:30,037][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:15:30,556][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:15:31,058][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:15:31,559][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:15:32,068][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:15:32,571][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:15:33,084][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:15:33,584][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:15:34,089][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:15:34,592][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:15:35,096][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:15:35,602][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:15:36,105][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:15:36,607][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:15:37,108][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:15:37,611][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:15:38,117][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:15:38,623][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:15:39,126][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:15:39,632][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:15:40,131][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:15:40,633][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:15:41,141][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:15:41,646][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:15:42,152][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:15:42,658][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:15:43,164][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:15:43,683][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:15:44,189][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:15:44,694][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:15:45,196][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:15:45,702][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:15:46,208][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:15:46,713][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:15:47,218][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:15:47,724][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:15:48,229][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:15:48,735][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:15:49,243][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:15:49,748][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:15:50,255][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:15:50,761][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:15:51,266][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:15:51,774][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:15:52,281][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:15:52,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10600 tokens. [2025-11-13 02:15:53,506][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:33 [2025-11-13 02:15:54,272][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:15:54,274][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:15:54,276][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:15:55,177][__main__][INFO] - Iteration 280 took 53s (31.64% Gen, 66.66% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 8m 13s. Estimated total time: 44h 14m 27s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 28s, 500 more iterations: 7h 22m 24s. [2025-11-13 02:15:55,180][__main__][INFO] - Starting iteration 280. [2025-11-13 02:15:55,821][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 27 and human policies 1. [2025-11-13 02:15:55,821][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:16:00,583][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:16:03,253][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:16:13,928][__main__][INFO] - Number of regex retries in iteration 280: 2 [2025-11-13 02:16:13,928][__main__][INFO] - agents played in iteration 280 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:16:14,738][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:16:14,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:16:14,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:16:14,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:16:14,806][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:16:14,806][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:16:15,474][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:16:15,934][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:16:16,442][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:16:16,945][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:16:17,445][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:16:17,950][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:16:18,456][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:16:18,960][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:16:19,476][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:16:19,982][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:16:20,494][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:16:20,998][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:16:21,505][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:16:22,010][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:16:22,512][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:16:23,014][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:16:23,518][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:16:24,021][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:16:24,525][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:16:25,027][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:16:25,529][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:16:26,033][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:16:26,537][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:16:27,041][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:16:27,543][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:16:28,050][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:16:28,555][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:16:29,060][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:16:29,569][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:16:30,073][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:16:30,575][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:16:31,082][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:16:31,585][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:16:32,085][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:16:32,592][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:16:33,097][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:16:33,615][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:16:34,125][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:16:34,632][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:16:35,142][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:16:35,653][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:16:36,167][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:16:36,671][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:16:37,178][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:16:37,694][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:16:38,198][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:16:38,704][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:16:39,209][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:16:39,714][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:16:40,228][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:16:40,733][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:16:41,238][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:16:41,742][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:16:42,245][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:16:42,750][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:16:43,254][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:16:43,758][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:16:44,264][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:16:44,771][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:16:45,271][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:16:45,788][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:16:46,291][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:16:46,804][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:16:47,308][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:16:47,810][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10709 tokens. [2025-11-13 02:16:48,541][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:33 [2025-11-13 02:16:49,321][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:16:49,322][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:16:49,324][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:16:51,094][__main__][INFO] - Iteration 281 took 55s (32.76% Gen, 64.04% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 56m 32s. Estimated total time: 46h 3m 41s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 7s, 500 more iterations: 7h 40m 36s. [2025-11-13 02:16:51,096][__main__][INFO] - Starting iteration 281. [2025-11-13 02:16:51,584][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 02:16:51,584][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:17:07,166][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:17:09,619][__main__][INFO] - Number of regex retries in iteration 281: 1 [2025-11-13 02:17:09,620][__main__][INFO] - agents played in iteration 281 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:17:10,449][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:17:10,475][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:17:10,501][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:17:10,524][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:17:10,524][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:17:10,525][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:17:11,200][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:17:11,658][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:17:12,167][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:17:12,669][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:17:13,180][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:17:13,689][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:17:14,204][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:17:14,707][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:17:15,212][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:17:15,725][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:17:16,230][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:17:16,736][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:17:17,241][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:17:17,748][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:17:18,253][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:17:18,759][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:17:19,268][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:17:19,772][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:17:20,274][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:17:20,789][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:17:21,293][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:17:21,800][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:17:22,304][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:17:22,807][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:17:23,322][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:17:23,834][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:17:24,344][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:17:24,848][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:17:25,354][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:17:25,859][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:17:26,366][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:17:26,875][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:17:27,385][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:17:27,894][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:17:28,398][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:17:28,906][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:17:29,413][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:17:29,922][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:17:30,428][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:17:30,933][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:17:31,439][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:17:31,944][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:17:32,457][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:17:32,965][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:17:33,474][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:17:33,992][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:17:34,497][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:17:35,009][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:17:35,512][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:17:36,016][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:17:36,521][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:17:37,026][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:17:37,527][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:17:38,031][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:17:38,534][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:17:39,038][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:17:39,540][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:17:40,046][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:17:40,552][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:17:41,057][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:17:41,564][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:17:42,069][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:17:42,574][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:17:43,080][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:17:43,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10571 tokens. [2025-11-13 02:17:44,359][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.37%, ΔTime: 00:00:33 [2025-11-13 02:17:45,116][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:17:45,118][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:17:45,120][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:17:46,083][__main__][INFO] - Iteration 282 took 54s (33.09% Gen, 65.14% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 16m 54s. Estimated total time: 45h 24m 59s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 49s, 500 more iterations: 7h 34m 9s. [2025-11-13 02:17:46,085][__main__][INFO] - Starting iteration 282. [2025-11-13 02:17:46,556][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 02:17:46,556][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:17:52,138][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:18:03,843][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:18:05,947][__main__][INFO] - Number of regex retries in iteration 282: 2 [2025-11-13 02:18:05,948][__main__][INFO] - agents played in iteration 282 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:18:06,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:18:06,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:18:06,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:18:06,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:18:06,874][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:18:06,875][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:18:07,528][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:18:08,003][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:18:08,511][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:18:09,009][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:18:09,514][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:18:10,018][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:18:10,524][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:18:11,027][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:18:11,528][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:18:12,035][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:18:12,541][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:18:13,046][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:18:13,549][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:18:14,054][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:18:14,561][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:18:15,066][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:18:15,572][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:18:16,084][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:18:16,588][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:18:17,100][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:18:17,604][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:18:18,109][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:18:18,619][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:18:19,125][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:18:19,631][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:18:20,139][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:18:20,642][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:18:21,147][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:18:21,650][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:18:22,158][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:18:22,667][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:18:23,173][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:18:23,679][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:18:24,185][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:18:24,693][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:18:25,207][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:18:25,714][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:18:26,217][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:18:26,724][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:18:27,232][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:18:27,745][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:18:28,250][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:18:28,754][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:18:29,263][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:18:29,770][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:18:30,272][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:18:30,778][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:18:31,284][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:18:31,791][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:18:32,292][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:18:32,797][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:18:33,308][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:18:33,813][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:18:34,331][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:18:34,834][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:18:35,342][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:18:35,848][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:18:36,355][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:18:36,863][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:18:37,367][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:18:37,870][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:18:38,377][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:18:38,885][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:18:39,389][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:18:39,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10671 tokens. [2025-11-13 02:18:40,621][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 02:18:41,400][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:18:41,402][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:18:41,403][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:18:42,358][__main__][INFO] - Iteration 283 took 55s (34.75% Gen, 63.54% Train). Generation: 19s, Training: 35s. Estimated remaining time: 42h 21m 9s. Estimated total time: 46h 30m 9s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 0s, 500 more iterations: 7h 45m 1s. [2025-11-13 02:18:42,360][__main__][INFO] - Starting iteration 283. [2025-11-13 02:18:42,830][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 02:18:42,830][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:18:47,194][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:19:01,110][__main__][INFO] - Number of regex retries in iteration 283: 1 [2025-11-13 02:19:01,110][__main__][INFO] - agents played in iteration 283 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:19:01,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:19:01,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:19:01,949][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:19:01,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:19:01,973][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:19:01,973][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:19:02,630][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:19:03,109][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:19:03,614][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:19:04,116][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:19:04,623][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:19:05,125][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:19:05,633][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:19:06,138][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:19:06,645][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:19:07,155][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:19:07,664][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:19:08,170][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:19:08,675][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:19:09,179][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:19:09,687][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:19:10,195][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:19:10,701][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:19:11,213][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:19:11,718][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:19:12,232][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:19:12,735][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:19:13,239][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:19:13,743][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:19:14,249][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:19:14,752][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:19:15,260][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:19:15,768][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:19:16,275][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:19:16,782][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:19:17,289][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:19:17,805][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:19:18,311][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:19:18,829][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:19:19,335][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:19:19,841][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:19:20,351][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:19:20,858][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:19:21,362][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:19:21,869][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:19:22,373][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:19:22,883][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:19:23,385][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:19:23,889][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:19:24,398][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:19:24,904][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:19:25,407][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:19:25,910][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:19:26,412][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:19:26,926][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:19:27,429][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:19:27,932][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:19:28,436][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:19:28,941][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:19:29,451][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:19:29,956][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:19:30,463][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:19:30,969][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:19:31,469][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:19:31,972][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:19:32,472][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:19:32,975][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:19:33,479][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:19:33,980][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:19:34,483][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:19:34,983][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10440 tokens. [2025-11-13 02:19:35,885][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.39%, ΔTime: 00:00:33 [2025-11-13 02:19:36,646][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:19:36,648][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:19:36,650][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:19:37,562][__main__][INFO] - Iteration 284 took 54s (33.40% Gen, 64.93% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 26m 40s. Estimated total time: 45h 36m 36s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 13s, 500 more iterations: 7h 36m 6s. [2025-11-13 02:19:37,564][__main__][INFO] - Starting iteration 284. [2025-11-13 02:19:38,040][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 02:19:38,041][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:19:45,594][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:19:47,009][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:19:56,699][__main__][INFO] - Number of regex retries in iteration 284: 2 [2025-11-13 02:19:56,700][__main__][INFO] - agents played in iteration 284 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:19:57,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:19:57,504][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:19:57,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:19:57,547][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:19:57,548][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:19:57,549][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:19:58,237][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:19:58,700][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:19:59,213][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:19:59,718][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:20:00,222][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:20:00,728][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:20:01,232][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:20:01,735][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:20:02,258][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:20:02,760][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:20:03,265][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:20:03,768][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:20:04,271][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:20:04,776][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:20:05,281][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:20:05,787][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:20:06,292][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:20:06,799][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:20:07,302][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:20:07,807][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:20:08,312][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:20:08,816][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:20:09,320][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:20:09,823][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:20:10,332][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:20:10,835][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:20:11,347][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:20:11,854][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:20:12,361][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:20:12,869][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:20:13,377][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:20:13,884][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:20:14,390][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:20:14,896][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:20:15,402][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:20:15,909][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:20:16,414][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:20:16,933][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:20:17,439][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:20:17,943][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:20:18,450][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:20:18,955][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:20:19,468][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:20:19,970][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:20:20,475][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:20:20,978][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:20:21,483][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:20:21,988][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:20:22,491][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:20:22,994][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:20:23,499][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:20:24,003][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:20:24,506][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:20:25,009][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:20:25,512][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:20:26,015][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:20:26,519][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:20:27,023][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:20:27,533][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:20:28,037][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:20:28,549][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:20:29,052][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:20:29,555][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:20:30,063][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:20:30,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10705 tokens. [2025-11-13 02:20:31,289][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:33 [2025-11-13 02:20:32,067][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:20:32,069][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:20:32,070][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:20:33,021][__main__][INFO] - Iteration 285 took 54s (33.94% Gen, 64.33% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 38m 12s. Estimated total time: 45h 49m 3s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 38s, 500 more iterations: 7h 38m 10s. [2025-11-13 02:20:33,023][__main__][INFO] - Starting iteration 285. [2025-11-13 02:20:33,506][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 02:20:33,507][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:20:52,446][__main__][INFO] - Number of regex retries in iteration 285: 0 [2025-11-13 02:20:52,446][__main__][INFO] - agents played in iteration 285 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:20:53,319][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:20:53,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:20:53,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:20:53,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:20:53,392][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:20:53,392][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:20:54,106][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:20:54,571][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:20:55,080][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:20:55,585][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:20:56,095][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:20:56,596][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:20:57,105][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:20:57,620][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:20:58,125][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:20:58,633][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:20:59,140][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:20:59,648][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:21:00,163][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:21:00,667][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:21:01,171][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:21:01,674][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:21:02,179][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:21:02,685][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:21:03,196][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:21:03,698][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:21:04,203][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:21:04,707][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:21:05,213][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:21:05,720][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:21:06,228][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:21:06,748][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:21:07,253][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:21:07,759][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:21:08,267][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:21:08,773][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:21:09,282][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:21:09,792][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:21:10,297][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:21:10,806][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:21:11,312][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:21:11,817][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:21:12,320][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:21:12,824][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:21:13,332][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:21:13,833][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:21:14,335][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:21:14,837][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:21:15,341][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:21:15,861][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:21:16,365][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:21:16,869][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:21:17,371][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:21:17,874][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:21:18,378][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:21:18,877][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:21:19,378][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:21:19,883][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:21:20,381][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:21:20,884][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:21:21,392][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:21:21,892][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:21:22,393][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:21:22,895][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:21:23,397][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:21:23,899][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:21:24,403][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:21:24,908][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:21:25,406][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:21:25,906][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:21:26,411][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10632 tokens. [2025-11-13 02:21:27,162][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.35%, ΔTime: 00:00:33 [2025-11-13 02:21:27,959][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:21:27,960][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:21:27,962][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:21:28,901][__main__][INFO] - Iteration 286 took 55s (34.19% Gen, 64.11% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 58m 1s. Estimated total time: 46h 9m 48s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 19s, 500 more iterations: 7h 41m 38s. [2025-11-13 02:21:28,903][__main__][INFO] - Starting iteration 286. [2025-11-13 02:21:29,393][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 02:21:29,394][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:21:37,988][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:21:48,979][__main__][INFO] - Number of regex retries in iteration 286: 1 [2025-11-13 02:21:48,980][__main__][INFO] - agents played in iteration 286 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:21:49,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:21:49,845][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:21:49,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:21:49,890][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:21:49,891][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:21:49,891][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:21:50,599][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:21:51,074][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:21:51,582][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:21:52,090][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:21:52,592][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:21:53,098][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:21:53,600][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:21:54,105][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:21:54,607][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:21:55,117][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:21:55,625][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:21:56,130][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:21:56,636][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:21:57,143][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:21:57,661][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:21:58,164][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:21:58,668][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:21:59,169][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:21:59,673][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:22:00,180][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:22:00,683][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:22:01,191][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:22:01,693][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:22:02,195][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:22:02,699][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:22:03,207][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:22:03,714][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:22:04,225][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:22:04,731][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:22:05,237][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:22:05,744][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:22:06,250][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:22:06,753][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:22:07,253][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:22:07,752][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:22:08,257][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:22:08,762][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:22:09,274][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:22:09,776][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:22:10,278][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:22:10,790][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:22:11,291][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:22:11,793][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:22:12,294][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:22:12,793][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:22:13,299][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:22:13,805][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:22:14,307][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:22:14,811][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:22:15,312][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:22:15,813][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:22:16,317][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:22:16,822][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:22:17,327][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:22:17,832][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:22:18,334][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:22:18,848][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:22:19,358][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:22:19,876][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:22:20,381][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:22:20,910][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:22:21,417][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:22:21,925][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:22:22,428][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:22:22,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10539 tokens. [2025-11-13 02:22:23,605][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 02:22:24,380][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:22:24,381][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:22:24,383][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:22:25,296][__main__][INFO] - Iteration 287 took 55s (35.03% Gen, 63.33% Train). Generation: 19s, Training: 35s. Estimated remaining time: 42h 22m 27s. Estimated total time: 46h 35m 10s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 10s, 500 more iterations: 7h 45m 51s. [2025-11-13 02:22:25,299][__main__][INFO] - Starting iteration 287. [2025-11-13 02:22:25,764][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 02:22:25,765][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:22:44,377][__main__][INFO] - Number of regex retries in iteration 287: 0 [2025-11-13 02:22:44,378][__main__][INFO] - agents played in iteration 287 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:22:45,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:22:45,250][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:22:45,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:22:45,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:22:45,298][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:22:45,299][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:22:45,982][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:22:46,443][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:22:46,947][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:22:47,460][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:22:47,964][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:22:48,466][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:22:48,971][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:22:49,477][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:22:49,985][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:22:50,493][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:22:51,003][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:22:51,514][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:22:52,040][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:22:52,547][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:22:53,049][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:22:53,549][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:22:54,060][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:22:54,563][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:22:55,073][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:22:55,576][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:22:56,078][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:22:56,598][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:22:57,101][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:22:57,609][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:22:58,112][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:22:58,614][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:22:59,121][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:22:59,628][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:23:00,132][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:23:00,636][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:23:01,138][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:23:01,638][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:23:02,142][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:23:02,644][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:23:03,151][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:23:03,656][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:23:04,156][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:23:04,673][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:23:05,177][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:23:05,693][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:23:06,199][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:23:06,701][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:23:07,210][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:23:07,710][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:23:08,211][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:23:08,712][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:23:09,215][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:23:09,717][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:23:10,219][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:23:10,723][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:23:11,228][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:23:11,732][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:23:12,235][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:23:12,740][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:23:13,247][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:23:13,752][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:23:14,253][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:23:14,754][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:23:15,256][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:23:15,761][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:23:16,278][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:23:16,785][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:23:17,287][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:23:17,791][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:23:18,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10453 tokens. [2025-11-13 02:23:18,976][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 02:23:19,734][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:23:19,735][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:23:19,737][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:23:20,665][__main__][INFO] - Iteration 288 took 54s (33.90% Gen, 64.40% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 31m 23s. Estimated total time: 45h 45m 2s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 30s, 500 more iterations: 7h 37m 30s. [2025-11-13 02:23:20,667][__main__][INFO] - Starting iteration 288. [2025-11-13 02:23:21,138][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 02:23:21,138][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:23:25,794][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:23:33,123][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 1 y book, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:23:38,963][__main__][INFO] - Number of regex retries in iteration 288: 2 [2025-11-13 02:23:38,963][__main__][INFO] - agents played in iteration 288 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:23:39,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:23:39,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:23:39,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:23:39,833][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:23:39,833][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:23:39,834][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:23:40,518][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:23:40,976][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:23:41,504][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:23:42,011][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:23:42,518][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:23:43,023][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:23:43,527][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:23:44,034][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:23:44,539][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:23:45,046][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:23:45,552][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:23:46,059][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:23:46,564][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:23:47,072][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:23:47,574][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:23:48,106][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:23:48,614][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:23:49,117][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:23:49,619][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:23:50,129][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:23:50,639][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:23:51,140][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:23:51,643][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:23:52,147][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:23:52,651][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:23:53,160][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:23:53,666][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:23:54,171][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:23:54,692][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:23:55,196][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:23:55,709][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:23:56,213][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:23:56,719][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:23:57,225][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:23:57,758][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:23:58,262][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:23:58,767][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:23:59,272][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:23:59,776][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:24:00,278][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:24:00,784][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:24:01,301][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:24:01,805][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:24:02,315][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:24:02,818][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:24:03,321][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:24:03,825][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:24:04,329][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:24:04,833][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:24:05,335][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:24:05,834][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:24:06,337][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:24:06,840][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:24:07,341][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:24:07,841][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:24:08,343][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:24:08,844][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:24:09,347][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:24:09,851][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:24:10,356][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:24:10,858][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:24:11,362][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:24:11,865][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:24:12,369][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:24:12,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10586 tokens. [2025-11-13 02:24:13,593][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:33 [2025-11-13 02:24:14,370][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:24:14,372][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:24:14,374][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:24:15,288][__main__][INFO] - Iteration 289 took 54s (32.92% Gen, 65.39% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 52m 57s. Estimated total time: 45h 7m 31s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 15s, 500 more iterations: 7h 31m 15s. [2025-11-13 02:24:15,290][__main__][INFO] - Starting iteration 289. [2025-11-13 02:24:15,779][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 02:24:15,779][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:24:32,724][__main__][INFO] - Number of regex retries in iteration 289: 0 [2025-11-13 02:24:32,725][__main__][INFO] - agents played in iteration 289 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:24:33,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:24:33,562][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:24:33,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:24:33,612][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:24:33,613][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:24:33,614][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:24:34,291][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:24:34,754][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:24:35,262][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:24:35,768][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:24:36,280][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:24:36,786][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:24:37,293][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:24:37,796][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:24:38,302][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:24:38,827][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:24:39,333][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:24:39,840][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:24:40,347][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:24:40,852][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:24:41,365][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:24:41,871][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:24:42,374][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:24:42,905][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:24:43,408][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:24:43,927][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:24:44,432][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:24:44,937][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:24:45,444][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:24:45,949][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:24:46,453][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:24:46,957][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:24:47,464][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:24:47,971][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:24:48,475][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:24:48,980][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:24:49,486][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:24:49,991][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:24:50,509][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:24:51,015][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:24:51,520][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:24:52,029][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:24:52,532][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:24:53,037][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:24:53,544][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:24:54,047][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:24:54,552][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:24:55,054][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:24:55,556][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:24:56,059][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:24:56,563][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:24:57,068][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:24:57,570][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:24:58,073][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:24:58,573][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:24:59,079][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:24:59,583][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:25:00,086][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:25:00,589][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:25:01,104][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:25:01,606][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:25:02,121][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:25:02,623][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:25:03,123][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:25:03,628][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:25:04,131][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:25:04,634][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:25:05,133][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:25:05,637][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:25:06,151][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:25:06,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10713 tokens. [2025-11-13 02:25:07,357][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.17%, ΔTime: 00:00:33 [2025-11-13 02:25:08,120][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:25:08,121][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:25:08,123][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:25:09,083][__main__][INFO] - Iteration 290 took 53s (31.79% Gen, 66.41% Train). Generation: 16s, Training: 35s. Estimated remaining time: 40h 9m 45s. Estimated total time: 44h 25m 13s. Time estimates for 10 more iterations: 8m 53s, 100 more iterations: 1h 28m 50s, 500 more iterations: 7h 24m 12s. [2025-11-13 02:25:09,085][__main__][INFO] - Starting iteration 290. [2025-11-13 02:25:09,555][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 28 and human policies 1. [2025-11-13 02:25:09,555][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:25:17,689][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:25:28,056][__main__][INFO] - Number of regex retries in iteration 290: 1 [2025-11-13 02:25:28,057][__main__][INFO] - agents played in iteration 290 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:25:28,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:25:28,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:25:28,896][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:25:28,919][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:25:28,919][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:25:28,920][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:25:29,632][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:25:30,092][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:25:30,603][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:25:31,110][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:25:31,615][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:25:32,116][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:25:32,618][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:25:33,122][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:25:33,644][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:25:34,154][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:25:34,662][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:25:35,167][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:25:35,671][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:25:36,177][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:25:36,681][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:25:37,183][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:25:37,688][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:25:38,189][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:25:38,689][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:25:39,189][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:25:39,689][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:25:40,192][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:25:40,691][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:25:41,190][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:25:41,695][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:25:42,200][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:25:42,710][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:25:43,216][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:25:43,718][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:25:44,225][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:25:44,727][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:25:45,234][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:25:45,740][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:25:46,241][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:25:46,751][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:25:47,255][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:25:47,758][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:25:48,268][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:25:48,769][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:25:49,273][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:25:49,776][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:25:50,279][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:25:50,782][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:25:51,286][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:25:51,788][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:25:52,292][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:25:52,795][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:25:53,297][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:25:53,811][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:25:54,314][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:25:54,839][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:25:55,343][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:25:55,850][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:25:56,352][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:25:56,853][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:25:57,358][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:25:57,863][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:25:58,372][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:25:58,900][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:25:59,404][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:25:59,910][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:26:00,423][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:26:00,931][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:26:01,442][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:26:01,947][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10536 tokens. [2025-11-13 02:26:02,677][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.50%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:33 [2025-11-13 02:26:03,448][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:26:03,450][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:26:03,451][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:26:05,287][__main__][INFO] - Iteration 291 took 55s (33.20% Gen, 63.51% Train). Generation: 18s, Training: 35s. Estimated remaining time: 42h 10m 16s. Estimated total time: 46h 26m 39s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 53s, 500 more iterations: 7h 44m 26s. [2025-11-13 02:26:05,289][__main__][INFO] - Starting iteration 291. [2025-11-13 02:26:05,761][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 02:26:05,762][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:26:12,152][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:26:15,851][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:26:24,508][__main__][INFO] - Number of regex retries in iteration 291: 2 [2025-11-13 02:26:24,508][__main__][INFO] - agents played in iteration 291 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:26:25,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:26:25,386][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:26:25,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:26:25,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:26:25,430][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:26:25,431][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:26:26,083][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:26:26,558][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:26:27,068][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:26:27,577][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:26:28,083][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:26:28,587][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:26:29,090][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:26:29,592][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:26:30,095][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:26:30,601][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:26:31,104][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:26:31,607][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:26:32,105][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:26:32,605][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:26:33,111][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:26:33,612][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:26:34,114][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:26:34,633][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:26:35,134][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:26:35,646][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:26:36,152][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:26:36,656][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:26:37,170][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:26:37,674][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:26:38,179][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:26:38,686][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:26:39,191][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:26:39,697][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:26:40,202][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:26:40,703][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:26:41,212][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:26:41,719][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:26:42,220][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:26:42,730][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:26:43,244][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:26:43,758][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:26:44,263][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:26:44,769][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:26:45,274][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:26:45,779][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:26:46,281][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:26:46,784][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:26:47,288][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:26:47,793][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:26:48,297][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:26:48,800][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:26:49,305][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:26:49,808][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:26:50,311][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:26:50,816][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:26:51,320][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:26:51,821][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:26:52,322][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:26:52,823][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:26:53,326][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:26:53,829][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:26:54,346][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:26:54,848][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:26:55,350][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:26:55,856][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:26:56,358][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:26:56,866][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:26:57,371][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:26:57,874][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:26:58,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10622 tokens. [2025-11-13 02:26:59,165][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 02:26:59,917][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:26:59,918][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:26:59,923][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:27:00,826][__main__][INFO] - Iteration 292 took 55s (34.04% Gen, 64.31% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 35m 56s. Estimated total time: 45h 53m 15s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 46s, 500 more iterations: 7h 38m 52s. [2025-11-13 02:27:00,828][__main__][INFO] - Starting iteration 292. [2025-11-13 02:27:01,303][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 02:27:01,304][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:27:18,073][__main__][INFO] - Number of regex retries in iteration 292: 0 [2025-11-13 02:27:18,074][__main__][INFO] - agents played in iteration 292 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:27:18,865][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:27:18,890][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:27:18,916][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:27:18,939][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:27:18,939][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:27:18,940][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:27:19,616][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:27:20,075][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:27:20,590][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:27:21,096][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:27:21,605][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:27:22,110][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:27:22,611][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:27:23,116][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:27:23,618][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:27:24,122][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:27:24,640][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:27:25,143][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:27:25,656][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:27:26,159][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:27:26,662][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:27:27,166][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:27:27,668][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:27:28,169][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:27:28,669][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:27:29,173][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:27:29,680][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:27:30,182][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:27:30,684][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:27:31,192][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:27:31,697][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:27:32,201][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:27:32,706][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:27:33,215][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:27:33,721][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:27:34,225][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:27:34,730][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:27:35,231][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:27:35,737][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:27:36,258][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:27:36,763][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:27:37,272][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:27:37,778][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:27:38,284][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:27:38,787][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:27:39,289][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:27:39,791][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:27:40,295][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:27:40,797][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:27:41,298][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:27:41,804][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:27:42,307][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:27:42,807][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:27:43,308][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:27:43,812][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:27:44,320][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:27:44,826][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:27:45,335][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:27:45,838][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:27:46,342][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:27:46,847][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:27:47,353][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:27:47,859][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:27:48,361][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:27:48,868][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:27:49,383][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:27:49,889][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:27:50,396][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:27:50,901][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:27:51,408][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:27:51,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10558 tokens. [2025-11-13 02:27:52,651][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 02:27:53,428][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:27:53,429][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:27:53,431][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:27:54,319][__main__][INFO] - Iteration 293 took 53s (31.63% Gen, 66.69% Train). Generation: 16s, Training: 35s. Estimated remaining time: 39h 52m 36s. Estimated total time: 44h 10m 49s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 21s, 500 more iterations: 7h 21m 48s. [2025-11-13 02:27:54,321][__main__][INFO] - Starting iteration 293. [2025-11-13 02:27:54,810][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 02:27:54,810][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:28:07,829][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:28:12,913][__main__][INFO] - Number of regex retries in iteration 293: 1 [2025-11-13 02:28:12,914][__main__][INFO] - agents played in iteration 293 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:28:13,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:28:13,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:28:13,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:28:13,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:28:13,776][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:28:13,776][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:28:14,490][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:28:14,950][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:28:15,474][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:28:15,977][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:28:16,477][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:28:16,975][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:28:17,475][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:28:17,978][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:28:18,482][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:28:18,987][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:28:19,493][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:28:19,994][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:28:20,496][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:28:20,997][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:28:21,498][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:28:22,004][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:28:22,507][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:28:23,009][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:28:23,515][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:28:24,017][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:28:24,522][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:28:25,025][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:28:25,527][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:28:26,039][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:28:26,541][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:28:27,045][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:28:27,552][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:28:28,055][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:28:28,574][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:28:29,080][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:28:29,585][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:28:30,092][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:28:30,597][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:28:31,103][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:28:31,609][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:28:32,114][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:28:32,621][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:28:33,123][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:28:33,626][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:28:34,138][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:28:34,639][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:28:35,150][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:28:35,652][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:28:36,153][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:28:36,660][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:28:37,165][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:28:37,675][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:28:38,177][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:28:38,682][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:28:39,194][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:28:39,694][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:28:40,196][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:28:40,698][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:28:41,203][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:28:41,709][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:28:42,210][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:28:42,715][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:28:43,226][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:28:43,729][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:28:44,236][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:28:44,744][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:28:45,250][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:28:45,759][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:28:46,264][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:28:46,770][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10575 tokens. [2025-11-13 02:28:47,552][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:33 [2025-11-13 02:28:48,342][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:28:48,344][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:28:48,346][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:28:49,257][__main__][INFO] - Iteration 294 took 54s (33.25% Gen, 65.08% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 3m 17s. Estimated total time: 45h 22m 24s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 44s, 500 more iterations: 7h 33m 44s. [2025-11-13 02:28:49,259][__main__][INFO] - Starting iteration 294. [2025-11-13 02:28:49,753][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 02:28:49,754][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:28:56,418][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:29:02,926][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:29:09,704][__main__][INFO] - Number of regex retries in iteration 294: 2 [2025-11-13 02:29:09,704][__main__][INFO] - agents played in iteration 294 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:29:10,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:29:10,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:29:10,571][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:29:10,593][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:29:10,594][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:29:10,595][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:29:11,265][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:29:11,727][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:29:12,239][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:29:12,743][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:29:13,249][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:29:13,753][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:29:14,256][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:29:14,764][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:29:15,267][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:29:15,767][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:29:16,274][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:29:16,777][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:29:17,291][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:29:17,789][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:29:18,290][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:29:18,791][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:29:19,295][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:29:19,803][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:29:20,308][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:29:20,814][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:29:21,324][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:29:21,829][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:29:22,336][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:29:22,846][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:29:23,351][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:29:23,858][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:29:24,359][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:29:24,864][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:29:25,369][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:29:25,874][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:29:26,375][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:29:26,878][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:29:27,382][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:29:27,897][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:29:28,400][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:29:28,905][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:29:29,405][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:29:29,909][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:29:30,415][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:29:30,917][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:29:31,420][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:29:31,924][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:29:32,427][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:29:32,928][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:29:33,432][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:29:33,936][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:29:34,445][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:29:34,948][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:29:35,452][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:29:35,956][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:29:36,464][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:29:36,986][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:29:37,489][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:29:37,995][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:29:38,499][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:29:39,005][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:29:39,510][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:29:40,015][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:29:40,522][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:29:41,025][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:29:41,531][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:29:42,038][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:29:42,545][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:29:43,051][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:29:43,578][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10451 tokens. [2025-11-13 02:29:44,324][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.39%, ΔTime: 00:00:33 [2025-11-13 02:29:45,112][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:29:45,114][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:29:45,115][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:29:46,009][__main__][INFO] - Iteration 295 took 56s (35.46% Gen, 62.95% Train). Generation: 19s, Training: 35s. Estimated remaining time: 42h 32m 44s. Estimated total time: 46h 52m 49s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 45s, 500 more iterations: 7h 48m 48s. [2025-11-13 02:29:46,011][__main__][INFO] - Starting iteration 295. [2025-11-13 02:29:46,492][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 02:29:46,492][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:29:53,740][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:30:06,907][__main__][INFO] - Number of regex retries in iteration 295: 1 [2025-11-13 02:30:06,908][__main__][INFO] - agents played in iteration 295 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:30:07,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:30:07,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:30:07,788][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:30:07,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:30:07,811][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:30:07,812][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:30:08,486][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:30:08,946][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:30:09,455][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:30:09,971][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:30:10,476][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:30:10,976][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:30:11,481][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:30:11,984][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:30:12,487][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:30:12,989][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:30:13,492][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:30:13,999][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:30:14,505][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:30:15,014][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:30:15,518][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:30:16,028][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:30:16,535][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:30:17,043][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:30:17,549][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:30:18,053][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:30:18,560][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:30:19,072][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:30:19,576][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:30:20,081][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:30:20,586][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:30:21,087][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:30:21,601][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:30:22,102][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:30:22,606][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:30:23,106][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:30:23,606][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:30:24,108][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:30:24,612][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:30:25,116][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:30:25,615][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:30:26,113][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:30:26,612][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:30:27,113][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:30:27,611][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:30:28,119][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:30:28,619][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:30:29,126][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:30:29,632][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:30:30,135][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:30:30,638][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:30:31,144][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:30:31,649][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:30:32,169][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:30:32,673][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:30:33,185][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:30:33,692][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:30:34,197][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:30:34,709][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:30:35,211][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:30:35,714][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:30:36,220][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:30:36,723][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:30:37,232][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:30:37,739][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:30:38,247][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:30:38,761][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:30:39,264][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:30:39,778][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:30:40,282][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:30:40,787][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10409 tokens. [2025-11-13 02:30:41,547][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.17%, ΔTime: 00:00:33 [2025-11-13 02:30:42,309][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:30:42,311][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:30:42,313][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:30:43,156][__main__][INFO] - Iteration 296 took 56s (36.03% Gen, 62.48% Train). Generation: 20s, Training: 35s. Estimated remaining time: 42h 52m 13s. Estimated total time: 47h 13m 14s. Time estimates for 10 more iterations: 9m 26s, 100 more iterations: 1h 34m 26s, 500 more iterations: 7h 52m 12s. [2025-11-13 02:30:43,158][__main__][INFO] - Starting iteration 296. [2025-11-13 02:30:43,666][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 02:30:43,667][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:31:03,758][__main__][INFO] - Number of regex retries in iteration 296: 0 [2025-11-13 02:31:03,759][__main__][INFO] - agents played in iteration 296 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:31:04,580][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:31:04,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:31:04,636][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:31:04,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:31:04,660][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:31:04,660][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:31:05,315][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:31:05,776][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:31:06,285][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:31:06,787][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:31:07,305][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:31:07,807][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:31:08,309][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:31:08,813][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:31:09,320][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:31:09,823][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:31:10,324][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:31:10,826][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:31:11,330][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:31:11,836][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:31:12,339][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:31:12,843][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:31:13,349][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:31:13,858][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:31:14,364][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:31:14,868][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:31:15,375][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:31:15,875][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:31:16,387][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:31:16,887][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:31:17,388][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:31:17,896][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:31:18,405][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:31:18,913][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:31:19,419][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:31:19,925][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:31:20,433][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:31:20,942][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:31:21,449][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:31:21,956][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:31:22,459][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:31:22,981][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:31:23,486][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:31:23,991][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:31:24,496][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:31:25,003][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:31:25,511][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:31:26,017][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:31:26,524][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:31:27,035][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:31:27,539][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:31:28,045][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:31:28,551][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:31:29,057][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:31:29,564][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:31:30,069][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:31:30,574][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:31:31,079][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:31:31,584][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:31:32,105][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:31:32,608][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:31:33,113][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:31:33,617][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:31:34,122][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:31:34,630][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:31:35,140][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:31:35,647][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:31:36,157][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:31:36,664][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:31:37,179][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:31:37,685][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10515 tokens. [2025-11-13 02:31:38,430][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 02:31:39,199][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:31:39,201][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:31:39,203][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:31:40,093][__main__][INFO] - Iteration 297 took 56s (35.61% Gen, 62.82% Train). Generation: 20s, Training: 35s. Estimated remaining time: 42h 39m 23s. Estimated total time: 47h 1m 21s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 2s, 500 more iterations: 7h 50m 13s. [2025-11-13 02:31:40,096][__main__][INFO] - Starting iteration 297. [2025-11-13 02:31:40,587][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 02:31:40,588][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:31:50,135][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:32:00,519][__main__][INFO] - Number of regex retries in iteration 297: 1 [2025-11-13 02:32:00,520][__main__][INFO] - agents played in iteration 297 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:32:01,305][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:32:01,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:32:01,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:32:01,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:32:01,372][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:32:01,372][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:32:02,066][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:32:02,533][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:32:03,043][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:32:03,547][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:32:04,051][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:32:04,551][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:32:05,054][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:32:05,557][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:32:06,061][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:32:06,587][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:32:07,092][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:32:07,605][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:32:08,108][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:32:08,609][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:32:09,117][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:32:09,626][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:32:10,132][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:32:10,640][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:32:11,144][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:32:11,649][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:32:12,152][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:32:12,656][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:32:13,170][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:32:13,677][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:32:14,191][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:32:14,719][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:32:15,226][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:32:15,733][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:32:16,238][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:32:16,742][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:32:17,249][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:32:17,754][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:32:18,261][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:32:18,768][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:32:19,275][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:32:19,777][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:32:20,287][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:32:20,812][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:32:21,316][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:32:21,826][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:32:22,335][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:32:22,845][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:32:23,348][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:32:23,851][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:32:24,355][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:32:24,863][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:32:25,369][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:32:25,874][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:32:26,390][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:32:26,901][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:32:27,423][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:32:27,931][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:32:28,442][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:32:28,951][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:32:29,456][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:32:29,960][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:32:30,470][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:32:30,974][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:32:31,495][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:32:32,000][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:32:32,500][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:32:33,008][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:32:33,510][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:32:34,021][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:32:34,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10449 tokens. [2025-11-13 02:32:35,238][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.43%, ΔTime: 00:00:33 [2025-11-13 02:32:36,011][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:32:36,013][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:32:36,015][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:32:36,979][__main__][INFO] - Iteration 298 took 56s (35.34% Gen, 62.94% Train). Generation: 19s, Training: 35s. Estimated remaining time: 42h 36m 43s. Estimated total time: 46h 59m 38s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 59s, 500 more iterations: 7h 49m 56s. [2025-11-13 02:32:36,982][__main__][INFO] - Starting iteration 298. [2025-11-13 02:32:37,471][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 02:32:37,471][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:32:42,881][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:32:57,012][__main__][INFO] - Number of regex retries in iteration 298: 1 [2025-11-13 02:32:57,013][__main__][INFO] - agents played in iteration 298 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:32:57,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:32:57,878][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:32:57,904][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:32:57,942][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:32:57,943][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:32:57,945][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:32:58,673][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:32:59,137][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:32:59,646][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:33:00,154][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:33:00,668][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:33:01,176][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:33:01,683][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:33:02,196][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:33:02,702][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:33:03,209][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:33:03,717][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:33:04,224][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:33:04,736][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:33:05,260][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:33:05,768][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:33:06,273][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:33:06,779][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:33:07,306][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:33:07,810][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:33:08,313][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:33:08,821][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:33:09,326][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:33:09,832][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:33:10,334][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:33:10,839][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:33:11,345][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:33:11,849][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:33:12,351][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:33:12,856][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:33:13,359][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:33:13,867][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:33:14,372][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:33:14,884][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:33:15,391][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:33:15,895][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:33:16,419][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:33:16,921][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:33:17,427][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:33:17,932][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:33:18,437][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:33:18,944][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:33:19,453][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:33:19,954][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:33:20,460][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:33:20,966][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:33:21,472][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:33:21,981][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:33:22,490][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:33:23,003][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:33:23,510][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:33:24,014][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:33:24,523][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:33:25,025][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:33:25,532][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:33:26,037][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:33:26,541][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:33:27,042][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:33:27,544][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:33:28,052][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:33:28,558][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:33:29,061][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:33:29,564][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:33:30,064][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:33:30,564][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:33:31,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10533 tokens. [2025-11-13 02:33:31,800][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 02:33:32,570][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:33:32,572][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:33:32,574][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:33:33,533][__main__][INFO] - Iteration 299 took 56s (34.86% Gen, 63.43% Train). Generation: 19s, Training: 35s. Estimated remaining time: 42h 19m 15s. Estimated total time: 46h 43m 7s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 26s, 500 more iterations: 7h 47m 11s. [2025-11-13 02:33:33,535][__main__][INFO] - Starting iteration 299. [2025-11-13 02:33:34,011][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 02:33:34,011][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:33:39,925][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:33:52,031][__main__][INFO] - Number of regex retries in iteration 299: 1 [2025-11-13 02:33:52,031][__main__][INFO] - agents played in iteration 299 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:33:52,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:33:52,892][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:33:52,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:33:52,941][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:33:52,942][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:33:52,943][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:33:53,681][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:33:54,149][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:33:54,661][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:33:55,168][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:33:55,677][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:33:56,187][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:33:56,707][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:33:57,213][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:33:57,719][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:33:58,226][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:33:58,733][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:33:59,249][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:33:59,756][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:34:00,263][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:34:00,770][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:34:01,275][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:34:01,782][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:34:02,285][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:34:02,788][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:34:03,315][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:34:03,817][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:34:04,317][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:34:04,825][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:34:05,333][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:34:05,840][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:34:06,346][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:34:06,852][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:34:07,359][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:34:07,868][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:34:08,373][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:34:08,879][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:34:09,386][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:34:09,893][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:34:10,403][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:34:10,909][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:34:11,414][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:34:11,925][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:34:12,434][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:34:12,940][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:34:13,448][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:34:13,959][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:34:14,461][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:34:14,967][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:34:15,468][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:34:15,971][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:34:16,479][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:34:16,987][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:34:17,490][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:34:18,000][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:34:18,504][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:34:19,024][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:34:19,529][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:34:20,036][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:34:20,538][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:34:21,044][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:34:21,553][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:34:22,057][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:34:22,557][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:34:23,059][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:34:23,563][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:34:24,065][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:34:24,567][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:34:25,067][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:34:25,570][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:34:26,070][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10521 tokens. [2025-11-13 02:34:26,782][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 02:34:27,550][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:34:27,552][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:34:27,554][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:34:28,540][__main__][INFO] - Iteration 300 took 54s (33.05% Gen, 65.14% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 1m 42s. Estimated total time: 45h 26m 29s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 52s, 500 more iterations: 7h 34m 24s. [2025-11-13 02:34:28,542][__main__][INFO] - Starting iteration 300. [2025-11-13 02:34:29,030][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 29 and human policies 1. [2025-11-13 02:34:29,031][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:34:40,823][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:34:47,109][__main__][INFO] - Number of regex retries in iteration 300: 1 [2025-11-13 02:34:47,110][__main__][INFO] - agents played in iteration 300 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:34:47,917][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:34:47,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:34:47,963][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:34:47,986][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:34:47,986][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:34:47,987][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:34:48,726][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:34:49,188][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:34:49,696][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:34:50,205][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:34:50,710][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:34:51,213][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:34:51,719][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:34:52,219][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:34:52,720][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:34:53,230][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:34:53,733][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:34:54,237][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:34:54,742][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:34:55,244][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:34:55,750][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:34:56,254][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:34:56,758][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:34:57,263][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:34:57,766][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:34:58,271][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:34:58,774][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:34:59,281][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:34:59,781][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:35:00,282][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:35:00,786][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:35:01,301][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:35:01,808][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:35:02,328][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:35:02,832][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:35:03,338][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:35:03,840][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:35:04,346][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:35:04,851][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:35:05,353][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:35:05,858][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:35:06,363][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:35:06,873][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:35:07,377][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:35:07,880][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:35:08,379][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:35:08,884][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:35:09,386][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:35:09,889][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:35:10,391][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:35:10,891][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:35:11,392][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:35:11,897][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:35:12,401][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:35:12,908][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:35:13,417][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:35:13,937][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:35:14,438][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:35:14,944][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:35:15,452][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:35:15,957][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:35:16,463][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:35:16,968][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:35:17,475][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:35:17,982][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:35:18,486][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:35:18,991][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:35:19,495][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:35:20,001][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:35:20,510][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:35:21,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10512 tokens. [2025-11-13 02:35:21,731][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:33 [2025-11-13 02:35:22,502][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:35:22,503][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:35:22,505][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:35:24,319][__main__][INFO] - Iteration 301 took 55s (32.70% Gen, 64.02% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 38m 45s. Estimated total time: 46h 4m 28s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 8s, 500 more iterations: 7h 40m 44s. [2025-11-13 02:35:24,321][__main__][INFO] - Starting iteration 301. [2025-11-13 02:35:24,820][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 02:35:24,821][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:35:41,362][__main__][INFO] - Number of regex retries in iteration 301: 0 [2025-11-13 02:35:41,363][__main__][INFO] - agents played in iteration 301 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:35:42,201][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:35:42,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:35:42,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:35:42,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:35:42,271][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:35:42,271][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:35:43,019][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:35:43,481][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:35:43,994][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:35:44,501][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:35:45,007][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:35:45,519][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:35:46,023][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:35:46,525][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:35:47,034][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:35:47,539][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:35:48,058][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:35:48,563][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:35:49,068][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:35:49,571][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:35:50,075][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:35:50,582][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:35:51,085][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:35:51,590][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:35:52,096][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:35:52,601][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:35:53,107][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:35:53,607][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:35:54,110][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:35:54,612][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:35:55,115][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:35:55,615][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:35:56,118][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:35:56,618][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:35:57,138][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:35:57,640][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:35:58,144][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:35:58,647][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:35:59,148][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:35:59,651][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:36:00,158][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:36:00,664][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:36:01,173][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:36:01,676][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:36:02,176][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:36:02,675][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:36:03,173][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:36:03,680][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:36:04,180][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:36:04,684][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:36:05,187][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:36:05,686][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:36:06,189][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:36:06,691][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:36:07,195][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:36:07,699][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:36:08,205][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:36:08,706][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:36:09,219][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:36:09,718][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:36:10,222][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:36:10,725][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:36:11,226][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:36:11,746][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:36:12,248][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:36:12,753][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:36:13,262][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:36:13,767][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:36:14,295][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:36:14,805][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:36:15,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10572 tokens. [2025-11-13 02:36:16,027][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.26%, Current % of VRAM taken: 58.50%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 02:36:16,824][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:36:16,826][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:36:16,827][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:36:17,737][__main__][INFO] - Iteration 302 took 52s (31.26% Gen, 67.02% Train). Generation: 16s, Training: 35s. Estimated remaining time: 39h 39m 15s. Estimated total time: 44h 5m 51s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 11s, 500 more iterations: 7h 20m 58s. [2025-11-13 02:36:17,739][__main__][INFO] - Starting iteration 302. [2025-11-13 02:36:18,223][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 02:36:18,224][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:36:36,132][__main__][INFO] - Number of regex retries in iteration 302: 0 [2025-11-13 02:36:36,132][__main__][INFO] - agents played in iteration 302 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:36:36,938][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:36:36,961][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:36:36,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:36:37,006][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:36:37,006][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:36:37,007][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:36:37,739][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:36:38,196][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:36:38,710][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:36:39,213][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:36:39,716][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:36:40,223][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:36:40,734][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:36:41,236][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:36:41,751][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:36:42,255][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:36:42,761][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:36:43,269][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:36:43,775][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:36:44,284][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:36:44,793][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:36:45,299][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:36:45,806][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:36:46,311][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:36:46,817][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:36:47,334][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:36:47,835][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:36:48,339][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:36:48,844][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:36:49,346][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:36:49,868][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:36:50,369][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:36:50,872][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:36:51,374][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:36:51,876][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:36:52,383][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:36:52,886][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:36:53,391][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:36:53,892][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:36:54,397][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:36:54,933][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:36:55,443][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:36:55,952][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:36:56,460][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:36:56,964][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:36:57,472][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:36:57,979][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:36:58,482][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:36:58,987][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:36:59,487][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:36:59,989][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:37:00,495][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:37:00,993][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:37:01,494][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:37:01,998][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:37:02,503][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:37:03,008][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:37:03,512][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:37:04,014][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:37:04,517][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:37:05,023][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:37:05,538][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:37:06,040][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:37:06,542][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:37:07,043][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:37:07,547][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:37:08,060][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:37:08,563][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:37:09,067][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:37:09,568][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:37:10,069][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10427 tokens. [2025-11-13 02:37:10,764][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.38%, ΔTime: 00:00:33 [2025-11-13 02:37:11,525][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:37:11,527][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:37:11,529][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:37:12,447][__main__][INFO] - Iteration 303 took 54s (33.03% Gen, 65.28% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 43m 41s. Estimated total time: 45h 11m 12s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 22s, 500 more iterations: 7h 31m 52s. [2025-11-13 02:37:12,449][__main__][INFO] - Starting iteration 303. [2025-11-13 02:37:12,927][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 02:37:12,928][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:37:18,480][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:37:18,743][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:37:29,060][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:37:29,233][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 1球 did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:37:30,182][__main__][INFO] - Number of regex retries in iteration 303: 4 [2025-11-13 02:37:30,183][__main__][INFO] - agents played in iteration 303 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:37:30,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:37:31,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:37:31,038][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:37:31,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:37:31,061][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:37:31,062][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:37:31,791][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:37:32,254][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:37:32,764][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:37:33,268][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:37:33,774][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:37:34,283][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:37:34,790][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:37:35,291][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:37:35,796][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:37:36,295][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:37:36,794][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:37:37,295][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:37:37,795][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:37:38,298][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:37:38,800][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:37:39,299][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:37:39,802][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:37:40,305][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:37:40,809][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:37:41,313][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:37:41,812][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:37:42,315][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:37:42,817][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:37:43,316][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:37:43,827][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:37:44,330][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:37:44,836][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:37:45,336][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:37:45,838][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:37:46,343][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:37:46,848][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:37:47,350][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:37:47,854][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:37:48,359][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:37:48,859][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:37:49,365][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:37:49,871][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:37:50,374][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:37:50,873][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:37:51,375][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:37:51,877][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:37:52,376][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:37:52,884][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:37:53,387][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:37:53,890][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:37:54,393][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:37:54,892][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:37:55,398][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:37:55,901][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:37:56,402][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:37:56,901][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:37:57,402][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:37:57,907][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:37:58,412][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:37:58,917][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:37:59,422][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:37:59,931][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:38:00,435][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:38:00,941][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:38:01,447][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:38:01,972][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:38:02,478][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:38:02,986][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:38:03,490][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:38:03,996][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10557 tokens. [2025-11-13 02:38:04,740][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.18%, ΔTime: 00:00:32 [2025-11-13 02:38:05,498][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:38:05,500][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:38:05,502][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:38:06,350][__main__][INFO] - Iteration 304 took 53s (32.30% Gen, 66.11% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 2m 44s. Estimated total time: 44h 31m 9s. Time estimates for 10 more iterations: 8m 54s, 100 more iterations: 1h 29m 2s, 500 more iterations: 7h 25m 11s. [2025-11-13 02:38:06,353][__main__][INFO] - Starting iteration 304. [2025-11-13 02:38:06,838][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 02:38:06,839][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:38:12,005][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:38:25,122][__main__][INFO] - Number of regex retries in iteration 304: 1 [2025-11-13 02:38:25,123][__main__][INFO] - agents played in iteration 304 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:38:25,942][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:38:25,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:38:25,990][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:38:26,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:38:26,013][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:38:26,014][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:38:26,769][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:38:27,234][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:38:27,746][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:38:28,256][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:38:28,759][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:38:29,263][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:38:29,773][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:38:30,275][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:38:30,787][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:38:31,288][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:38:31,788][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:38:32,299][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:38:32,803][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:38:33,306][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:38:33,808][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:38:34,308][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:38:34,816][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:38:35,314][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:38:35,817][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:38:36,323][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:38:36,825][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:38:37,327][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:38:37,826][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:38:38,327][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:38:38,830][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:38:39,332][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:38:39,837][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:38:40,339][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:38:40,844][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:38:41,350][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:38:41,854][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:38:42,359][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:38:42,877][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:38:43,383][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:38:43,895][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:38:44,398][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:38:44,900][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:38:45,409][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:38:45,909][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:38:46,409][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:38:46,911][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:38:47,410][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:38:47,914][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:38:48,416][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:38:48,917][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:38:49,422][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:38:49,925][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:38:50,424][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:38:50,927][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:38:51,426][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:38:51,930][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:38:52,429][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:38:52,931][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:38:53,433][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:38:53,935][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:38:54,435][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:38:54,940][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:38:55,447][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:38:55,954][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:38:56,456][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:38:56,962][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:38:57,467][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:38:57,972][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:38:58,493][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:38:58,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10523 tokens. [2025-11-13 02:38:59,773][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 02:39:00,525][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:39:00,527][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:39:00,529][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:39:01,469][__main__][INFO] - Iteration 305 took 54s (33.47% Gen, 64.81% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 2m 15s. Estimated total time: 45h 31m 35s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 3s, 500 more iterations: 7h 35m 15s. [2025-11-13 02:39:01,471][__main__][INFO] - Starting iteration 305. [2025-11-13 02:39:01,939][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 02:39:01,939][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:39:20,240][__main__][INFO] - Number of regex retries in iteration 305: 0 [2025-11-13 02:39:20,240][__main__][INFO] - agents played in iteration 305 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:39:21,038][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:39:21,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:39:21,083][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:39:21,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:39:21,105][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:39:21,106][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:39:21,761][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:39:22,221][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:39:22,730][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:39:23,234][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:39:23,741][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:39:24,244][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:39:24,751][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:39:25,266][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:39:25,771][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:39:26,291][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:39:26,797][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:39:27,303][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:39:27,806][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:39:28,310][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:39:28,816][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:39:29,320][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:39:29,825][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:39:30,331][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:39:30,832][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:39:31,333][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:39:31,835][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:39:32,336][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:39:32,838][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:39:33,340][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:39:33,849][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:39:34,356][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:39:34,856][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:39:35,386][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:39:35,888][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:39:36,394][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:39:36,905][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:39:37,411][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:39:37,916][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:39:38,417][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:39:38,916][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:39:39,417][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:39:39,916][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:39:40,414][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:39:40,913][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:39:41,413][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:39:41,915][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:39:42,417][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:39:42,920][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:39:43,421][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:39:43,921][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:39:44,431][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:39:44,934][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:39:45,433][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:39:45,934][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:39:46,433][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:39:46,934][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:39:47,436][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:39:47,940][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:39:48,444][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:39:48,948][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:39:49,451][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:39:49,964][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:39:50,475][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:39:50,994][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:39:51,499][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:39:52,007][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:39:52,519][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:39:53,028][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:39:53,532][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:39:54,038][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10446 tokens. [2025-11-13 02:39:54,800][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 02:39:55,551][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:39:55,552][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:39:55,554][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:39:56,426][__main__][INFO] - Iteration 306 took 54s (33.59% Gen, 64.81% Train). Generation: 18s, Training: 35s. Estimated remaining time: 40h 54m 8s. Estimated total time: 45h 24m 23s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 48s, 500 more iterations: 7h 34m 3s. [2025-11-13 02:39:56,428][__main__][INFO] - Starting iteration 306. [2025-11-13 02:39:56,896][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 02:39:56,897][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:40:01,215][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:40:07,989][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:40:14,881][__main__][INFO] - Number of regex retries in iteration 306: 2 [2025-11-13 02:40:14,882][__main__][INFO] - agents played in iteration 306 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:40:15,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:40:15,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:40:15,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:40:15,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:40:15,804][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:40:15,804][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:40:16,492][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:40:16,947][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:40:17,454][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:40:17,954][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:40:18,453][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:40:18,966][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:40:19,467][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:40:19,978][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:40:20,480][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:40:21,009][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:40:21,515][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:40:22,019][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:40:22,526][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:40:23,033][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:40:23,537][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:40:24,045][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:40:24,551][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:40:25,055][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:40:25,558][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:40:26,057][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:40:26,560][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:40:27,062][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:40:27,566][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:40:28,081][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:40:28,582][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:40:29,102][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:40:29,608][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:40:30,114][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:40:30,623][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:40:31,127][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:40:31,633][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:40:32,140][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:40:32,648][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:40:33,156][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:40:33,656][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:40:34,156][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:40:34,656][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:40:35,157][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:40:35,657][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:40:36,158][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:40:36,662][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:40:37,166][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:40:37,666][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:40:38,168][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:40:38,686][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:40:39,187][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:40:39,703][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:40:40,209][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:40:40,718][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:40:41,227][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:40:41,735][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:40:42,240][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:40:42,751][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:40:43,255][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:40:43,762][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:40:44,267][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:40:44,770][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:40:45,278][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:40:45,782][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:40:46,301][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:40:46,807][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:40:47,312][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:40:47,821][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:40:48,325][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:40:48,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10416 tokens. [2025-11-13 02:40:49,577][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.35%, ΔTime: 00:00:33 [2025-11-13 02:40:50,378][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:40:50,380][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:40:50,381][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:40:51,304][__main__][INFO] - Iteration 307 took 54s (33.05% Gen, 65.25% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 49m 16s. Estimated total time: 45h 20m 26s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 40s, 500 more iterations: 7h 33m 24s. [2025-11-13 02:40:51,306][__main__][INFO] - Starting iteration 307. [2025-11-13 02:40:51,787][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 02:40:51,787][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:40:56,525][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:41:02,669][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:41:09,926][__main__][INFO] - Number of regex retries in iteration 307: 2 [2025-11-13 02:41:09,927][__main__][INFO] - agents played in iteration 307 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:41:10,800][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:41:10,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:41:10,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:41:10,869][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:41:10,870][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:41:10,871][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:41:11,576][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:41:12,035][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:41:12,545][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:41:13,056][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:41:13,557][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:41:14,060][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:41:14,564][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:41:15,068][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:41:15,573][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:41:16,076][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:41:16,582][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:41:17,085][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:41:17,588][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:41:18,118][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:41:18,621][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:41:19,125][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:41:19,629][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:41:20,132][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:41:20,634][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:41:21,137][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:41:21,644][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:41:22,148][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:41:22,652][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:41:23,157][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:41:23,659][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:41:24,164][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:41:24,666][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:41:25,172][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:41:25,679][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:41:26,195][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:41:26,699][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:41:27,212][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:41:27,718][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:41:28,222][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:41:28,727][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:41:29,234][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:41:29,737][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:41:30,235][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:41:30,734][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:41:31,236][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:41:31,735][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:41:32,239][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:41:32,745][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:41:33,249][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:41:33,752][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:41:34,252][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:41:34,759][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:41:35,270][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:41:35,777][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:41:36,283][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:41:36,801][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:41:37,314][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:41:37,828][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:41:38,335][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:41:38,837][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:41:39,347][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:41:39,851][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:41:40,360][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:41:40,865][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:41:41,370][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:41:41,883][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:41:42,391][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:41:42,893][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:41:43,405][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:41:43,909][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10551 tokens. [2025-11-13 02:41:44,623][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 02:41:45,397][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:41:45,398][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:41:45,400][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:41:46,321][__main__][INFO] - Iteration 308 took 54s (33.26% Gen, 65.04% Train). Generation: 18s, Training: 35s. Estimated remaining time: 40h 54m 40s. Estimated total time: 45h 26m 45s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 53s, 500 more iterations: 7h 34m 27s. [2025-11-13 02:41:46,323][__main__][INFO] - Starting iteration 308. [2025-11-13 02:41:46,831][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 02:41:46,831][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:41:51,995][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:42:03,854][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:42:05,722][__main__][INFO] - Number of regex retries in iteration 308: 2 [2025-11-13 02:42:05,723][__main__][INFO] - agents played in iteration 308 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:42:06,579][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:42:06,604][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:42:06,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:42:06,649][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:42:06,649][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:42:06,651][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:42:07,353][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:42:07,812][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:42:08,325][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:42:08,834][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:42:09,346][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:42:09,849][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:42:10,351][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:42:10,858][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:42:11,363][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:42:11,870][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:42:12,375][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:42:12,880][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:42:13,387][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:42:13,892][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:42:14,397][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:42:14,901][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:42:15,407][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:42:15,911][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:42:16,415][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:42:16,917][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:42:17,422][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:42:17,922][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:42:18,445][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:42:18,947][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:42:19,452][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:42:19,960][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:42:20,461][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:42:20,968][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:42:21,469][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:42:21,971][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:42:22,472][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:42:22,978][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:42:23,481][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:42:23,986][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:42:24,490][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:42:24,995][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:42:25,498][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:42:25,997][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:42:26,511][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:42:27,010][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:42:27,515][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:42:28,019][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:42:28,520][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:42:29,025][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:42:29,533][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:42:30,040][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:42:30,550][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:42:31,054][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:42:31,567][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:42:32,073][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:42:32,579][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:42:33,088][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:42:33,592][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:42:34,097][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:42:34,603][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:42:35,109][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:42:35,613][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:42:36,115][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:42:36,618][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:42:37,128][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:42:37,631][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:42:38,143][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:42:38,648][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:42:39,149][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:42:39,656][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10507 tokens. [2025-11-13 02:42:40,387][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.19%, ΔTime: 00:00:33 [2025-11-13 02:42:41,200][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:42:41,202][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:42:41,203][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:42:42,145][__main__][INFO] - Iteration 309 took 55s (34.15% Gen, 64.14% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 32m 44s. Estimated total time: 46h 5m 44s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 11s, 500 more iterations: 7h 40m 57s. [2025-11-13 02:42:42,147][__main__][INFO] - Starting iteration 309. [2025-11-13 02:42:42,632][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 02:42:42,632][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:42:49,411][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:43:00,875][__main__][INFO] - Number of regex retries in iteration 309: 1 [2025-11-13 02:43:00,876][__main__][INFO] - agents played in iteration 309 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:43:01,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:43:01,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:43:01,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:43:01,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:43:01,820][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:43:01,821][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:43:02,511][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:43:02,977][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:43:03,486][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:43:03,989][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:43:04,494][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:43:04,995][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:43:05,496][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:43:06,005][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:43:06,506][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:43:07,047][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:43:07,551][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:43:08,056][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:43:08,560][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:43:09,064][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:43:09,571][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:43:10,075][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:43:10,579][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:43:11,085][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:43:11,592][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:43:12,099][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:43:12,604][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:43:13,109][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:43:13,625][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:43:14,129][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:43:14,633][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:43:15,135][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:43:15,640][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:43:16,144][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:43:16,645][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:43:17,147][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:43:17,663][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:43:18,168][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:43:18,675][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:43:19,182][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:43:19,689][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:43:20,197][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:43:20,704][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:43:21,212][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:43:21,731][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:43:22,242][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:43:22,756][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:43:23,262][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:43:23,769][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:43:24,281][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:43:24,787][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:43:25,292][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:43:25,797][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:43:26,302][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:43:26,818][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:43:27,323][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:43:27,828][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:43:28,337][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:43:28,840][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:43:29,351][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:43:29,854][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:43:30,358][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:43:30,864][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:43:31,367][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:43:31,869][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:43:32,376][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:43:32,877][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:43:33,382][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:43:33,883][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:43:34,388][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:43:34,894][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10569 tokens. [2025-11-13 02:43:35,621][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:33 [2025-11-13 02:43:36,418][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:43:36,421][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:43:36,423][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:43:37,344][__main__][INFO] - Iteration 310 took 54s (33.35% Gen, 64.97% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 1m 42s. Estimated total time: 45h 35m 38s. Time estimates for 10 more iterations: 9m 7s, 100 more iterations: 1h 31m 11s, 500 more iterations: 7h 35m 56s. [2025-11-13 02:43:37,346][__main__][INFO] - Starting iteration 310. [2025-11-13 02:43:37,814][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 30 and human policies 1. [2025-11-13 02:43:37,815][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:43:42,778][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:43:56,328][__main__][INFO] - Number of regex retries in iteration 310: 1 [2025-11-13 02:43:56,329][__main__][INFO] - agents played in iteration 310 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:43:57,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:43:57,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:43:57,217][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:43:57,239][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:43:57,240][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:43:57,241][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:43:57,946][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:43:58,411][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:43:58,921][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:43:59,426][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:43:59,935][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:44:00,437][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:44:00,949][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:44:01,453][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:44:01,957][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:44:02,467][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:44:02,970][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:44:03,470][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:44:03,971][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:44:04,475][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:44:04,981][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:44:05,486][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:44:05,989][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:44:06,492][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:44:06,995][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:44:07,504][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:44:08,007][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:44:08,511][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:44:09,026][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:44:09,529][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:44:10,043][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:44:10,548][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:44:11,052][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:44:11,559][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:44:12,066][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:44:12,574][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:44:13,081][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:44:13,588][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:44:14,095][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:44:14,601][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:44:15,107][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:44:15,617][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:44:16,120][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:44:16,627][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:44:17,133][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:44:17,636][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:44:18,146][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:44:18,653][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:44:19,160][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:44:19,667][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:44:20,172][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:44:20,679][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:44:21,181][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:44:21,687][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:44:22,192][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:44:22,697][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:44:23,203][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:44:23,704][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:44:24,206][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:44:24,709][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:44:25,210][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:44:25,714][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:44:26,215][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:44:26,721][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:44:27,230][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:44:27,733][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:44:28,234][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:44:28,750][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:44:29,255][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:44:29,768][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:44:30,273][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10571 tokens. [2025-11-13 02:44:31,046][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:33 [2025-11-13 02:44:31,847][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:44:31,848][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:44:31,850][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:44:33,675][__main__][INFO] - Iteration 311 took 55s (33.14% Gen, 63.59% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 58m 12s. Estimated total time: 46h 33m 4s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 6s, 500 more iterations: 7h 45m 30s. [2025-11-13 02:44:33,678][__main__][INFO] - Starting iteration 311. [2025-11-13 02:44:34,161][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 02:44:34,161][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:44:39,284][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:44:39,933][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:44:50,456][__main__][INFO] - Number of regex retries in iteration 311: 2 [2025-11-13 02:44:50,457][__main__][INFO] - agents played in iteration 311 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:44:51,255][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:44:51,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:44:51,305][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:44:51,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:44:51,328][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:44:51,329][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:44:52,044][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:44:52,503][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:44:53,013][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:44:53,518][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:44:54,021][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:44:54,532][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:44:55,034][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:44:55,535][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:44:56,042][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:44:56,541][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:44:57,040][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:44:57,548][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:44:58,051][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:44:58,552][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:44:59,052][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:44:59,552][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:45:00,072][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:45:00,575][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:45:01,077][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:45:01,587][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:45:02,108][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:45:02,618][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:45:03,122][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:45:03,628][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:45:04,135][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:45:04,638][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:45:05,144][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:45:05,663][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:45:06,168][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:45:06,689][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:45:07,196][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:45:07,698][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:45:08,207][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:45:08,715][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:45:09,223][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:45:09,729][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:45:10,237][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:45:10,753][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:45:11,260][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:45:11,775][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:45:12,281][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:45:12,782][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:45:13,286][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:45:13,792][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:45:14,296][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:45:14,800][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:45:15,306][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:45:15,813][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:45:16,318][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:45:16,824][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:45:17,330][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:45:17,835][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:45:18,341][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:45:18,847][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:45:19,353][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:45:19,861][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:45:20,364][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:45:20,870][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:45:21,374][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:45:21,882][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:45:22,386][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:45:22,892][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:45:23,397][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:45:23,903][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:45:24,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10586 tokens. [2025-11-13 02:45:25,179][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.38%, ΔTime: 00:00:33 [2025-11-13 02:45:25,964][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:45:25,966][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:45:25,967][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:45:26,914][__main__][INFO] - Iteration 312 took 52s (30.89% Gen, 67.31% Train). Generation: 16s, Training: 35s. Estimated remaining time: 39h 21m 56s. Estimated total time: 43h 57m 41s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 55s, 500 more iterations: 7h 19m 36s. [2025-11-13 02:45:26,916][__main__][INFO] - Starting iteration 312. [2025-11-13 02:45:27,405][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 02:45:27,406][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:45:45,021][__main__][INFO] - Number of regex retries in iteration 312: 0 [2025-11-13 02:45:45,021][__main__][INFO] - agents played in iteration 312 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:45:45,872][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:45:45,908][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:45:45,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:45:45,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:45:45,957][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:45:45,957][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:45:46,663][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:45:47,124][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:45:47,634][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:45:48,137][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:45:48,646][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:45:49,148][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:45:49,651][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:45:50,153][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:45:50,656][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:45:51,174][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:45:51,678][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:45:52,179][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:45:52,686][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:45:53,190][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:45:53,695][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:45:54,200][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:45:54,704][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:45:55,215][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:45:55,720][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:45:56,223][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:45:56,728][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:45:57,231][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:45:57,735][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:45:58,239][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:45:58,747][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:45:59,267][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:45:59,776][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:46:00,290][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:46:00,796][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:46:01,302][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:46:01,812][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:46:02,318][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:46:02,825][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:46:03,327][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:46:03,832][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:46:04,340][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:46:04,847][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:46:05,353][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:46:05,869][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:46:06,372][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:46:06,883][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:46:07,393][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:46:07,898][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:46:08,406][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:46:08,913][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:46:09,436][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:46:09,942][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:46:10,452][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:46:10,962][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:46:11,467][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:46:11,969][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:46:12,474][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:46:12,979][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:46:13,495][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:46:13,995][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:46:14,499][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:46:15,005][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:46:15,509][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:46:16,014][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:46:16,521][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:46:17,026][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:46:17,533][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:46:18,037][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:46:18,542][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:46:19,046][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10641 tokens. [2025-11-13 02:46:19,739][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.49%, ΔTime: 00:00:33 [2025-11-13 02:46:20,548][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:46:20,549][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:46:20,551][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:46:21,490][__main__][INFO] - Iteration 313 took 54s (32.57% Gen, 65.69% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 27m 37s. Estimated total time: 45h 4m 16s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 8s, 500 more iterations: 7h 30m 42s. [2025-11-13 02:46:21,493][__main__][INFO] - Starting iteration 313. [2025-11-13 02:46:21,984][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 02:46:21,985][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:46:32,477][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:46:39,395][__main__][INFO] - Number of regex retries in iteration 313: 1 [2025-11-13 02:46:39,396][__main__][INFO] - agents played in iteration 313 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:46:40,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:46:40,231][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:46:40,253][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:46:40,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:46:40,276][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:46:40,276][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:46:40,952][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:46:41,416][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:46:41,923][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:46:42,426][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:46:42,930][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:46:43,434][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:46:43,936][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:46:44,434][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:46:44,937][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:46:45,449][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:46:45,950][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:46:46,451][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:46:46,960][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:46:47,465][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:46:47,970][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:46:48,473][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:46:48,975][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:46:49,480][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:46:49,985][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:46:50,488][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:46:50,996][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:46:51,507][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:46:52,021][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:46:52,526][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:46:53,032][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:46:53,538][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:46:54,040][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:46:54,549][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:46:55,054][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:46:55,557][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:46:56,066][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:46:56,573][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:46:57,082][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:46:57,591][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:46:58,096][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:46:58,607][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:46:59,113][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:46:59,621][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:47:00,128][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:47:00,635][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:47:01,142][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:47:01,648][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:47:02,151][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:47:02,652][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:47:03,158][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:47:03,665][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:47:04,167][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:47:04,675][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:47:05,184][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:47:05,689][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:47:06,195][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:47:06,699][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:47:07,202][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:47:07,717][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:47:08,220][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:47:08,727][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:47:09,226][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:47:09,731][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:47:10,241][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:47:10,745][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:47:11,252][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:47:11,755][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:47:12,258][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:47:12,760][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:47:13,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10486 tokens. [2025-11-13 02:47:13,989][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:33 [2025-11-13 02:47:14,782][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:47:14,784][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:47:14,785][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:47:15,730][__main__][INFO] - Iteration 314 took 53s (32.40% Gen, 65.85% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 9m 44s. Estimated total time: 44h 47m 18s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 34s, 500 more iterations: 7h 27m 53s. [2025-11-13 02:47:15,732][__main__][INFO] - Starting iteration 314. [2025-11-13 02:47:16,219][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 02:47:16,220][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:47:23,302][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:47:35,246][__main__][INFO] - Number of regex retries in iteration 314: 1 [2025-11-13 02:47:35,246][__main__][INFO] - agents played in iteration 314 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:47:36,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:47:36,341][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:47:36,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:47:36,387][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:47:36,388][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:47:36,388][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:47:37,067][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:47:37,527][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:47:38,039][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:47:38,544][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:47:39,048][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:47:39,550][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:47:40,052][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:47:40,560][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:47:41,067][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:47:41,568][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:47:42,070][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:47:42,576][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:47:43,076][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:47:43,585][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:47:44,092][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:47:44,611][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:47:45,116][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:47:45,621][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:47:46,122][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:47:46,627][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:47:47,132][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:47:47,638][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:47:48,144][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:47:48,653][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:47:49,155][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:47:49,663][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:47:50,170][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:47:50,672][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:47:51,178][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:47:51,681][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:47:52,183][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:47:52,699][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:47:53,201][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:47:53,706][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:47:54,214][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:47:54,719][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:47:55,236][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:47:55,741][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:47:56,244][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:47:56,751][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:47:57,255][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:47:57,764][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:47:58,269][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:47:58,774][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:47:59,289][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:47:59,796][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:48:00,304][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:48:00,810][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:48:01,314][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:48:01,820][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:48:02,325][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:48:02,832][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:48:03,331][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:48:03,832][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:48:04,339][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:48:04,843][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:48:05,346][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:48:05,852][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:48:06,354][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:48:06,853][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:48:07,354][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:48:07,853][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:48:08,358][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:48:08,862][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:48:09,363][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10423 tokens. [2025-11-13 02:48:10,068][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.21%, ΔTime: 00:00:33 [2025-11-13 02:48:10,881][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:48:10,883][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:48:10,885][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:48:11,810][__main__][INFO] - Iteration 315 took 55s (34.23% Gen, 64.11% Train). Generation: 19s, Training: 35s. Estimated remaining time: 41h 41m 2s. Estimated total time: 46h 19m 32s. Time estimates for 10 more iterations: 9m 15s, 100 more iterations: 1h 32m 39s, 500 more iterations: 7h 43m 15s. [2025-11-13 02:48:11,812][__main__][INFO] - Starting iteration 315. [2025-11-13 02:48:12,295][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 02:48:12,296][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:48:17,202][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:48:30,747][__main__][INFO] - Number of regex retries in iteration 315: 1 [2025-11-13 02:48:30,748][__main__][INFO] - agents played in iteration 315 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:48:31,580][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:48:31,603][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:48:31,625][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:48:31,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:48:31,662][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:48:31,663][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:48:32,337][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:48:32,795][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:48:33,308][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:48:33,812][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:48:34,313][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:48:34,826][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:48:35,330][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:48:35,842][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:48:36,344][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:48:36,854][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:48:37,358][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:48:37,862][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:48:38,369][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:48:38,875][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:48:39,380][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:48:39,890][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:48:40,394][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:48:40,906][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:48:41,415][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:48:41,922][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:48:42,437][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:48:42,942][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:48:43,448][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:48:43,954][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:48:44,458][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:48:44,969][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:48:45,473][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:48:45,975][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:48:46,479][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:48:46,986][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:48:47,489][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:48:47,993][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:48:48,497][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:48:49,022][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:48:49,530][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:48:50,041][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:48:50,547][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:48:51,052][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:48:51,562][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:48:52,070][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:48:52,578][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:48:53,088][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:48:53,594][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:48:54,096][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:48:54,597][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:48:55,100][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:48:55,624][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:48:56,129][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:48:56,630][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:48:57,132][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:48:57,631][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:48:58,138][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:48:58,637][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:48:59,144][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:48:59,649][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:49:00,152][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:49:00,657][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:49:01,158][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:49:01,662][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:49:02,160][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:49:02,663][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:49:03,164][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:49:03,665][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:49:04,166][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:49:04,673][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10415 tokens. [2025-11-13 02:49:05,378][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 02:49:06,169][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:49:06,171][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:49:06,172][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:49:07,132][__main__][INFO] - Iteration 316 took 54s (33.65% Gen, 64.60% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 2m 27s. Estimated total time: 45h 41m 52s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 23s, 500 more iterations: 7h 36m 58s. [2025-11-13 02:49:07,134][__main__][INFO] - Starting iteration 316. [2025-11-13 02:49:07,618][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 02:49:07,619][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:49:26,196][__main__][INFO] - Number of regex retries in iteration 316: 0 [2025-11-13 02:49:26,197][__main__][INFO] - agents played in iteration 316 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:49:27,073][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:49:27,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:49:27,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:49:27,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:49:27,152][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:49:27,153][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:49:27,876][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:49:28,338][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:49:28,855][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:49:29,365][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:49:29,869][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:49:30,372][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:49:30,877][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:49:31,381][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:49:31,888][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:49:32,392][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:49:32,900][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:49:33,405][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:49:33,909][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:49:34,408][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:49:34,917][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:49:35,419][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:49:35,931][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:49:36,435][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:49:36,976][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:49:37,481][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:49:37,986][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:49:38,493][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:49:39,000][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:49:39,503][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:49:40,011][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:49:40,515][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:49:41,022][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:49:41,527][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:49:42,031][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:49:42,542][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:49:43,044][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:49:43,553][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:49:44,058][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:49:44,560][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:49:45,068][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:49:45,574][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:49:46,078][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:49:46,613][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:49:47,115][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:49:47,618][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:49:48,125][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:49:48,629][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:49:49,151][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:49:49,658][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:49:50,166][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:49:50,673][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:49:51,174][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:49:51,676][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:49:52,177][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:49:52,682][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:49:53,191][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:49:53,696][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:49:54,201][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:49:54,702][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:49:55,209][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:49:55,728][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:49:56,232][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:49:56,736][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:49:57,236][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:49:57,737][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:49:58,248][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:49:58,754][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:49:59,258][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:49:59,761][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:50:00,261][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10414 tokens. [2025-11-13 02:50:00,955][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 02:50:01,728][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:50:01,730][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:50:01,731][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:50:02,662][__main__][INFO] - Iteration 317 took 55s (33.75% Gen, 64.56% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 11m 52s. Estimated total time: 45h 52m 13s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 44s, 500 more iterations: 7h 38m 42s. [2025-11-13 02:50:02,664][__main__][INFO] - Starting iteration 317. [2025-11-13 02:50:03,135][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 02:50:03,136][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:50:20,846][__main__][INFO] - Number of regex retries in iteration 317: 0 [2025-11-13 02:50:20,847][__main__][INFO] - agents played in iteration 317 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:50:21,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:50:21,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:50:21,695][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:50:21,717][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:50:21,717][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:50:21,718][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:50:22,449][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:50:22,906][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:50:23,423][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:50:23,926][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:50:24,429][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:50:24,938][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:50:25,445][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:50:25,953][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:50:26,461][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:50:26,966][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:50:27,480][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:50:27,985][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:50:28,491][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:50:29,000][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:50:29,507][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:50:30,015][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:50:30,522][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:50:31,030][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:50:31,540][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:50:32,068][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:50:32,583][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:50:33,091][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:50:33,596][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:50:34,096][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:50:34,604][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:50:35,113][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:50:35,617][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:50:36,122][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:50:36,631][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:50:37,138][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:50:37,643][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:50:38,150][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:50:38,653][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:50:39,165][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:50:39,670][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:50:40,175][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:50:40,679][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:50:41,184][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:50:41,694][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:50:42,197][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:50:42,701][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:50:43,209][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:50:43,713][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:50:44,217][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:50:44,721][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:50:45,225][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:50:45,742][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:50:46,245][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:50:46,749][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:50:47,252][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:50:47,755][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:50:48,260][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:50:48,766][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:50:49,268][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:50:49,773][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:50:50,277][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:50:50,780][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:50:51,287][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:50:51,791][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:50:52,298][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:50:52,801][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:50:53,304][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:50:53,814][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:50:54,315][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:50:54,828][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10624 tokens. [2025-11-13 02:50:55,537][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.50%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:33 [2025-11-13 02:50:56,340][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:50:56,341][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:50:56,343][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:50:57,326][__main__][INFO] - Iteration 318 took 54s (32.68% Gen, 65.50% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 28m 17s. Estimated total time: 45h 9m 33s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 19s, 500 more iterations: 7h 31m 35s. [2025-11-13 02:50:57,328][__main__][INFO] - Starting iteration 318. [2025-11-13 02:50:57,797][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 02:50:57,797][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:51:11,469][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:51:12,746][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:51:15,284][__main__][INFO] - Number of regex retries in iteration 318: 2 [2025-11-13 02:51:15,284][__main__][INFO] - agents played in iteration 318 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:51:16,085][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:51:16,108][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:51:16,130][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:51:16,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:51:16,154][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:51:16,155][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:51:16,879][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:51:17,343][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:51:17,853][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:51:18,370][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:51:18,878][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:51:19,380][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:51:19,886][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:51:20,394][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:51:20,902][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:51:21,405][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:51:21,910][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:51:22,424][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:51:22,930][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:51:23,433][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:51:23,948][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:51:24,459][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:51:24,971][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:51:25,480][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:51:25,990][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:51:26,506][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:51:27,013][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:51:27,521][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:51:28,032][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:51:28,537][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:51:29,045][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:51:29,552][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:51:30,063][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:51:30,569][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:51:31,076][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:51:31,585][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:51:32,089][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:51:32,597][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:51:33,100][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:51:33,603][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:51:34,105][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:51:34,606][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:51:35,110][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:51:35,622][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:51:36,125][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:51:36,625][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:51:37,130][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:51:37,633][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:51:38,154][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:51:38,657][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:51:39,162][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:51:39,663][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:51:40,166][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:51:40,673][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:51:41,176][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:51:41,678][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:51:42,182][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:51:42,683][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:51:43,185][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:51:43,689][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:51:44,191][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:51:44,692][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:51:45,194][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:51:45,697][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:51:46,203][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:51:46,709][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:51:47,217][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:51:47,722][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:51:48,225][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:51:48,737][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:51:49,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10460 tokens. [2025-11-13 02:51:49,961][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.37%, ΔTime: 00:00:33 [2025-11-13 02:51:50,748][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:51:50,750][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:51:50,752][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:51:51,750][__main__][INFO] - Iteration 319 took 53s (32.41% Gen, 65.74% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 15m 32s. Estimated total time: 44h 57m 42s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 55s, 500 more iterations: 7h 29m 37s. [2025-11-13 02:51:51,752][__main__][INFO] - Starting iteration 319. [2025-11-13 02:51:52,226][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 02:51:52,227][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:51:56,504][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:52:06,099][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 20 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:52:09,956][__main__][INFO] - Number of regex retries in iteration 319: 2 [2025-11-13 02:52:09,957][__main__][INFO] - agents played in iteration 319 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:52:10,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:52:10,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:52:10,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:52:10,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:52:10,873][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:52:10,874][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:52:11,562][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:52:12,045][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:52:12,549][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:52:13,050][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:52:13,554][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:52:14,056][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:52:14,565][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:52:15,071][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:52:15,575][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:52:16,085][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:52:16,590][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:52:17,093][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:52:17,601][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:52:18,104][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:52:18,604][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:52:19,112][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:52:19,617][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:52:20,119][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:52:20,624][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:52:21,132][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:52:21,639][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:52:22,153][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:52:22,664][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:52:23,168][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:52:23,672][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:52:24,181][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:52:24,685][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:52:25,188][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:52:25,691][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:52:26,200][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:52:26,702][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:52:27,206][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:52:27,732][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:52:28,245][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:52:28,754][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:52:29,261][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:52:29,768][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:52:30,274][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:52:30,781][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:52:31,287][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:52:31,791][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:52:32,291][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:52:32,793][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:52:33,298][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:52:33,800][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:52:34,303][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:52:34,808][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:52:35,309][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:52:35,814][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:52:36,314][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:52:36,816][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:52:37,322][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:52:37,824][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:52:38,325][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:52:38,829][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:52:39,331][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:52:39,849][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:52:40,357][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:52:40,862][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:52:41,368][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:52:41,869][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:52:42,378][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:52:42,882][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:52:43,386][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:52:43,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10416 tokens. [2025-11-13 02:52:44,615][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.26%, Current % of VRAM taken: 58.51%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:33 [2025-11-13 02:52:45,417][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:52:45,419][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:52:45,420][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:52:46,327][__main__][INFO] - Iteration 320 took 54s (32.77% Gen, 65.55% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 22m 0s. Estimated total time: 45h 5m 4s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 10s, 500 more iterations: 7h 30m 50s. [2025-11-13 02:52:46,329][__main__][INFO] - Starting iteration 320. [2025-11-13 02:52:46,825][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 31 and human policies 1. [2025-11-13 02:52:46,825][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:52:52,423][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:53:05,313][__main__][INFO] - Number of regex retries in iteration 320: 1 [2025-11-13 02:53:05,314][__main__][INFO] - agents played in iteration 320 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:53:06,114][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:53:06,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:53:06,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:53:06,191][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:53:06,192][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:53:06,193][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:53:06,891][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:53:07,358][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:53:07,869][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:53:08,375][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:53:08,884][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:53:09,393][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:53:09,901][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:53:10,405][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:53:10,911][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:53:11,413][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:53:11,923][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:53:12,432][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:53:12,942][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:53:13,450][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:53:13,966][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:53:14,475][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:53:14,989][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:53:15,493][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:53:15,998][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:53:16,509][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:53:17,016][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:53:17,524][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:53:18,028][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:53:18,535][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:53:19,049][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:53:19,555][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:53:20,060][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:53:20,563][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:53:21,068][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:53:21,580][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:53:22,080][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:53:22,582][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:53:23,088][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:53:23,594][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:53:24,098][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:53:24,606][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:53:25,110][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:53:25,617][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:53:26,125][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:53:26,629][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:53:27,131][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:53:27,635][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:53:28,141][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:53:28,643][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:53:29,148][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:53:29,668][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:53:30,171][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:53:30,677][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:53:31,181][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:53:31,686][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:53:32,201][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:53:32,709][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:53:33,214][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:53:33,718][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:53:34,224][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:53:34,730][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:53:35,238][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:53:35,744][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:53:36,248][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:53:36,748][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:53:37,252][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:53:37,768][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:53:38,270][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:53:38,782][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:53:39,286][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10458 tokens. [2025-11-13 02:53:39,978][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.37%, ΔTime: 00:00:33 [2025-11-13 02:53:40,772][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:53:40,773][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:53:40,775][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:53:42,655][__main__][INFO] - Iteration 321 took 55s (33.11% Gen, 63.52% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 47m 31s. Estimated total time: 46h 31m 32s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 3s, 500 more iterations: 7h 45m 15s. [2025-11-13 02:53:42,657][__main__][INFO] - Starting iteration 321. [2025-11-13 02:53:43,134][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 02:53:43,134][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:53:57,394][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:54:01,237][__main__][INFO] - Number of regex retries in iteration 321: 1 [2025-11-13 02:54:01,238][__main__][INFO] - agents played in iteration 321 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:54:02,084][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:54:02,111][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:54:02,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:54:02,160][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:54:02,161][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:54:02,162][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:54:02,871][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:54:03,334][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:54:03,842][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:54:04,350][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:54:04,856][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:54:05,368][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:54:05,880][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:54:06,388][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:54:06,906][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:54:07,417][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:54:07,923][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:54:08,428][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:54:08,935][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:54:09,446][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:54:09,952][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:54:10,461][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:54:10,977][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:54:11,482][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:54:11,997][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:54:12,502][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:54:13,008][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:54:13,514][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:54:14,019][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:54:14,527][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:54:15,033][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:54:15,534][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:54:16,033][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:54:16,533][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:54:17,038][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:54:17,537][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:54:18,037][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:54:18,540][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:54:19,046][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:54:19,549][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:54:20,054][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:54:20,563][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:54:21,069][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:54:21,573][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:54:22,080][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:54:22,600][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:54:23,107][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:54:23,611][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:54:24,113][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:54:24,620][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:54:25,127][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:54:25,632][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:54:26,136][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:54:26,641][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:54:27,149][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:54:27,650][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:54:28,153][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:54:28,652][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:54:29,157][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:54:29,661][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:54:30,162][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:54:30,677][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:54:31,180][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:54:31,693][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:54:32,193][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:54:32,697][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:54:33,209][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:54:33,713][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:54:34,219][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:54:34,723][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:54:35,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10413 tokens. [2025-11-13 02:54:35,951][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:33 [2025-11-13 02:54:36,736][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:54:36,738][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:54:36,740][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:54:37,657][__main__][INFO] - Iteration 322 took 54s (33.20% Gen, 65.11% Train). Generation: 18s, Training: 35s. Estimated remaining time: 40h 41m 16s. Estimated total time: 45h 26m 12s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 52s, 500 more iterations: 7h 34m 22s. [2025-11-13 02:54:37,660][__main__][INFO] - Starting iteration 322. [2025-11-13 02:54:38,131][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 02:54:38,132][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:54:50,496][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:54:56,187][mllm.models.large_language_model_local][WARNING] - Response Given the per-item values and knowing Bob's per-item values, a strategic proposal would be to maximize the quantity of items I value highly while also considering the proportional allocation rule. Here’s the reasoning: 1. **Hats (Value: 10, Bob’s Value: 1)**: I should propose to take all 10 hats because this is my highest-valued resource. 2. **Books (Value: 1, Bob’s Value: 10)**: I should propose to take a very small number of books because they have a low value for me. Since most of the books will go to Bob, proposing a very small number minimizes my loss. 3. **Balls (Value: 10, Bob’s Value: 10)**: I should propose to take a fair share of the balls. Since we each value balls highly, it’s better to propose to take 5 balls each, which maximizes the expected outcome while avoiding the over-proposal rule. Given these considerations, my proposal would be: **Proposal: 10 hats, 0 books, 5 balls** did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:54:58,657][__main__][INFO] - Number of regex retries in iteration 322: 2 [2025-11-13 02:54:58,657][__main__][INFO] - agents played in iteration 322 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:54:59,507][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:54:59,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:54:59,562][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:54:59,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:54:59,585][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:54:59,586][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:55:00,321][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:55:00,779][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:55:01,292][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:55:01,801][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:55:02,306][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:55:02,805][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:55:03,311][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:55:03,816][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:55:04,325][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:55:04,828][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:55:05,335][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:55:05,839][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:55:06,339][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:55:06,847][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:55:07,349][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:55:07,852][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:55:08,371][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:55:08,874][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:55:09,373][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:55:09,879][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:55:10,377][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:55:10,885][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:55:11,388][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:55:11,888][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:55:12,396][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:55:12,904][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:55:13,410][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:55:13,913][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:55:14,417][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:55:14,921][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:55:15,427][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:55:15,931][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:55:16,436][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:55:16,935][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:55:17,441][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:55:17,945][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:55:18,446][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:55:18,956][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:55:19,458][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:55:19,960][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:55:20,479][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:55:20,980][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:55:21,493][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:55:21,993][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:55:22,493][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:55:22,996][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:55:23,498][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:55:23,996][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:55:24,494][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:55:24,995][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:55:25,498][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:55:25,999][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:55:26,499][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:55:27,001][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:55:27,505][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:55:28,006][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:55:28,507][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:55:29,008][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:55:29,512][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:55:30,011][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:55:30,513][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:55:31,013][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:55:31,512][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:55:32,016][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:55:32,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10393 tokens. [2025-11-13 02:55:33,238][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:32 [2025-11-13 02:55:34,011][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:55:34,012][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:55:34,014][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:55:34,962][__main__][INFO] - Iteration 323 took 56s (36.11% Gen, 62.21% Train). Generation: 20s, Training: 35s. Estimated remaining time: 42h 35m 40s. Estimated total time: 47h 21m 33s. Time estimates for 10 more iterations: 9m 28s, 100 more iterations: 1h 34m 43s, 500 more iterations: 7h 53m 35s. [2025-11-13 02:55:34,965][__main__][INFO] - Starting iteration 323. [2025-11-13 02:55:35,724][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 02:55:35,725][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:55:54,293][__main__][INFO] - Number of regex retries in iteration 323: 0 [2025-11-13 02:55:54,293][__main__][INFO] - agents played in iteration 323 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:55:55,133][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:55:55,156][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:55:55,179][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:55:55,201][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:55:55,202][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:55:55,203][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:55:56,032][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:55:56,499][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:55:57,012][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:55:57,523][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:55:58,035][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:55:58,549][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:55:59,057][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:55:59,563][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:56:00,075][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:56:00,583][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:56:01,111][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:56:01,617][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:56:02,125][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:56:02,639][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:56:03,145][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:56:03,650][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:56:04,155][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:56:04,658][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:56:05,162][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:56:05,663][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:56:06,164][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:56:06,673][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:56:07,175][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:56:07,679][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:56:08,181][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:56:08,686][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:56:09,200][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:56:09,708][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:56:10,212][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:56:10,717][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:56:11,220][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:56:11,727][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:56:12,232][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:56:12,732][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:56:13,238][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:56:13,742][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:56:14,246][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:56:14,752][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:56:15,257][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:56:15,759][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:56:16,267][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:56:16,773][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:56:17,290][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:56:17,797][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:56:18,311][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:56:18,812][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:56:19,341][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:56:19,844][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:56:20,348][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:56:20,854][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:56:21,357][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:56:21,860][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:56:22,367][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:56:22,873][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:56:23,373][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:56:23,884][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:56:24,388][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:56:24,898][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:56:25,397][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:56:25,896][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:56:26,410][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:56:26,914][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:56:27,416][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:56:27,921][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:56:28,425][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10409 tokens. [2025-11-13 02:56:29,150][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:33 [2025-11-13 02:56:29,913][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:56:29,914][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:56:29,918][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:56:30,924][__main__][INFO] - Iteration 324 took 55s (33.64% Gen, 64.54% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 13m 12s. Estimated total time: 46h 0m 1s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 0s, 500 more iterations: 7h 40m 0s. [2025-11-13 02:56:30,926][__main__][INFO] - Starting iteration 324. [2025-11-13 02:56:31,404][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 02:56:31,405][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:56:49,745][__main__][INFO] - Number of regex retries in iteration 324: 0 [2025-11-13 02:56:49,746][__main__][INFO] - agents played in iteration 324 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:56:50,631][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:56:50,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:56:50,682][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:56:50,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:56:50,706][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:56:50,707][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:56:51,460][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:56:51,919][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:56:52,435][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:56:52,950][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:56:53,453][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:56:53,959][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:56:54,460][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:56:54,964][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:56:55,474][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:56:55,984][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:56:56,486][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:56:56,997][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:56:57,498][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:56:58,011][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:56:58,517][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:56:59,021][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:56:59,528][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:57:00,031][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:57:00,532][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:57:01,033][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:57:01,536][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:57:02,043][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:57:02,547][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:57:03,049][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:57:03,550][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:57:04,052][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:57:04,557][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:57:05,062][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:57:05,567][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:57:06,077][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:57:06,582][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:57:07,091][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:57:07,595][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:57:08,099][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:57:08,619][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:57:09,124][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:57:09,632][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:57:10,134][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:57:10,637][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:57:11,142][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:57:11,648][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:57:12,153][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:57:12,658][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:57:13,159][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:57:13,662][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:57:14,162][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:57:14,666][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:57:15,166][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:57:15,669][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:57:16,170][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:57:16,674][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:57:17,175][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:57:17,679][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:57:18,184][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:57:18,686][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:57:19,189][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:57:19,698][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:57:20,222][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:57:20,727][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:57:21,231][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:57:21,739][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:57:22,243][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:57:22,750][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:57:23,256][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:57:23,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10497 tokens. [2025-11-13 02:57:24,569][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.37%, ΔTime: 00:00:33 [2025-11-13 02:57:25,339][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:57:25,341][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:57:25,343][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:57:26,205][__main__][INFO] - Iteration 325 took 54s (33.47% Gen, 64.96% Train). Generation: 18s, Training: 35s. Estimated remaining time: 40h 52m 20s. Estimated total time: 45h 40m 4s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 20s, 500 more iterations: 7h 36m 40s. [2025-11-13 02:57:26,207][__main__][INFO] - Starting iteration 325. [2025-11-13 02:57:26,691][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 02:57:26,692][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:57:30,967][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:57:41,127][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 11 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 02:57:44,750][__main__][INFO] - Number of regex retries in iteration 325: 2 [2025-11-13 02:57:44,751][__main__][INFO] - agents played in iteration 325 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:57:45,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:57:45,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:57:45,695][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:57:45,718][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:57:45,719][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:57:45,720][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:57:46,610][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:57:47,074][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:57:47,589][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:57:48,107][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:57:48,610][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:57:49,125][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:57:49,631][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:57:50,133][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:57:50,648][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:57:51,150][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:57:51,653][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:57:52,157][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:57:52,661][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:57:53,166][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:57:53,666][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:57:54,166][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:57:54,669][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:57:55,171][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:57:55,673][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:57:56,180][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:57:56,682][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:57:57,185][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:57:57,686][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:57:58,189][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:57:58,691][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:57:59,192][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:57:59,693][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:58:00,194][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:58:00,695][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:58:01,210][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:58:01,712][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:58:02,211][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:58:02,714][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:58:03,218][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:58:03,732][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:58:04,235][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:58:04,737][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:58:05,247][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:58:05,752][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:58:06,258][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:58:06,761][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:58:07,265][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:58:07,768][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:58:08,269][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:58:08,771][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:58:09,272][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:58:09,774][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:58:10,274][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:58:10,777][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:58:11,283][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:58:11,789][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:58:12,294][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:58:12,799][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:58:13,315][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:58:13,817][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:58:14,338][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:58:14,841][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:58:15,343][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:58:15,850][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:58:16,353][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:58:16,855][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:58:17,362][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:58:17,863][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:58:18,367][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:58:18,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10256 tokens. [2025-11-13 02:58:19,647][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 02:58:20,424][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:58:20,426][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:58:20,427][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:58:21,339][__main__][INFO] - Iteration 326 took 54s (33.05% Gen, 65.28% Train). Generation: 18s, Training: 35s. Estimated remaining time: 40h 43m 44s. Estimated total time: 45h 32m 24s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 4s, 500 more iterations: 7h 35m 24s. [2025-11-13 02:58:21,341][__main__][INFO] - Starting iteration 326. [2025-11-13 02:58:21,810][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 02:58:21,811][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:58:41,616][__main__][INFO] - Number of regex retries in iteration 326: 0 [2025-11-13 02:58:41,617][__main__][INFO] - agents played in iteration 326 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:58:42,498][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:58:42,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:58:42,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:58:42,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:58:42,573][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:58:42,573][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:58:43,280][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:58:43,744][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:58:44,254][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:58:44,762][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:58:45,268][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:58:45,775][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:58:46,278][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:58:46,786][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:58:47,291][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:58:47,792][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:58:48,294][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:58:48,797][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:58:49,298][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:58:49,809][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:58:50,309][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:58:50,811][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:58:51,311][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:58:51,815][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:58:52,328][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:58:52,831][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:58:53,337][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:58:53,845][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:58:54,349][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:58:54,856][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:58:55,359][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:58:55,865][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:58:56,369][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:58:56,873][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:58:57,377][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:58:57,881][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:58:58,390][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:58:58,897][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:58:59,401][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:58:59,902][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:59:00,413][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:59:00,918][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:59:01,423][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:59:01,933][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:59:02,437][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:59:02,940][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:59:03,440][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:59:03,945][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:59:04,447][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 02:59:04,951][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 02:59:05,470][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 02:59:05,974][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 02:59:06,475][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 02:59:06,979][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 02:59:07,485][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 02:59:07,992][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 02:59:08,493][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 02:59:08,996][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 02:59:09,507][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 02:59:10,011][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 02:59:10,511][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 02:59:11,014][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 02:59:11,517][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 02:59:12,019][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 02:59:12,519][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 02:59:13,022][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 02:59:13,525][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 02:59:14,028][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 02:59:14,537][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 02:59:15,038][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 02:59:15,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10290 tokens. [2025-11-13 02:59:16,319][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:33 [2025-11-13 02:59:17,091][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 02:59:17,093][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 02:59:17,094][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 02:59:17,956][__main__][INFO] - Iteration 327 took 56s (35.27% Gen, 63.19% Train). Generation: 19s, Training: 35s. Estimated remaining time: 41h 57m 44s. Estimated total time: 46h 47m 21s. Time estimates for 10 more iterations: 9m 21s, 100 more iterations: 1h 33m 34s, 500 more iterations: 7h 47m 53s. [2025-11-13 02:59:17,959][__main__][INFO] - Starting iteration 327. [2025-11-13 02:59:18,436][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 02:59:18,437][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 02:59:37,086][__main__][INFO] - Number of regex retries in iteration 327: 0 [2025-11-13 02:59:37,087][__main__][INFO] - agents played in iteration 327 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 02:59:37,917][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:59:37,939][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:59:37,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:59:37,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 02:59:37,985][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 02:59:37,986][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 02:59:38,692][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 02:59:39,150][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 02:59:39,657][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 02:59:40,158][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 02:59:40,680][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 02:59:41,183][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 02:59:41,688][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 02:59:42,189][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 02:59:42,693][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 02:59:43,199][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 02:59:43,702][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 02:59:44,204][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 02:59:44,706][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 02:59:45,205][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 02:59:45,714][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 02:59:46,218][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 02:59:46,717][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 02:59:47,222][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 02:59:47,724][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 02:59:48,228][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 02:59:48,744][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 02:59:49,250][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 02:59:49,765][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 02:59:50,265][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 02:59:50,769][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 02:59:51,278][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 02:59:51,776][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 02:59:52,281][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 02:59:52,786][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 02:59:53,291][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 02:59:53,794][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 02:59:54,297][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 02:59:54,803][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 02:59:55,311][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 02:59:55,861][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 02:59:56,375][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 02:59:56,882][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 02:59:57,391][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 02:59:57,900][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 02:59:58,406][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 02:59:58,912][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 02:59:59,415][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 02:59:59,921][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:00:00,426][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:00:00,930][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:00:01,432][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:00:01,950][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:00:02,454][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:00:02,965][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:00:03,466][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:00:03,970][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:00:04,476][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:00:04,977][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:00:05,481][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:00:05,984][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:00:06,489][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:00:06,996][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:00:07,501][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:00:08,008][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:00:08,513][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:00:09,020][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:00:09,532][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:00:10,038][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:00:10,544][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:00:11,052][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10402 tokens. [2025-11-13 03:00:11,859][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.27%, Current % of VRAM taken: 58.52%, Block Peak % of device VRAM: 62.36%, ΔTime: 00:00:33 [2025-11-13 03:00:12,626][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:00:12,628][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:00:12,629][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:00:13,516][__main__][INFO] - Iteration 328 took 55s (33.86% Gen, 64.53% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 3m 29s. Estimated total time: 45h 54m 1s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 48s, 500 more iterations: 7h 39m 0s. [2025-11-13 03:00:13,519][__main__][INFO] - Starting iteration 328. [2025-11-13 03:00:14,002][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 03:00:14,003][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:00:20,791][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:00:33,737][__main__][INFO] - Number of regex retries in iteration 328: 1 [2025-11-13 03:00:33,737][__main__][INFO] - agents played in iteration 328 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:00:34,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:00:34,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:00:34,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:00:34,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:00:34,618][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:00:34,619][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:00:35,309][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:00:35,776][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:00:36,283][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:00:36,789][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:00:37,292][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:00:37,795][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:00:38,298][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:00:38,806][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:00:39,306][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:00:39,812][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:00:40,313][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:00:40,816][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:00:41,319][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:00:41,825][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:00:42,328][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:00:42,831][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:00:43,337][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:00:43,852][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:00:44,354][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:00:44,870][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:00:45,372][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:00:45,876][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:00:46,390][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:00:46,896][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:00:47,402][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:00:47,909][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:00:48,413][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:00:48,920][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:00:49,426][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:00:49,931][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:00:50,438][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:00:50,940][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:00:51,448][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:00:51,956][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:00:52,457][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:00:52,980][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:00:53,481][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:00:53,985][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:00:54,495][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:00:54,998][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:00:55,498][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:00:55,999][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:00:56,500][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:00:57,005][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:00:57,506][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:00:58,009][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:00:58,512][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:00:59,017][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:00:59,516][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:01:00,020][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:01:00,524][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:01:01,028][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:01:01,532][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:01:02,035][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:01:02,551][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:01:03,055][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:01:03,571][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:01:04,075][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:01:04,582][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:01:05,095][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:01:05,596][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:01:06,101][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:01:06,604][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:01:07,106][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:01:07,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10255 tokens. [2025-11-13 03:01:08,354][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.08%, ΔTime: 00:00:33 [2025-11-13 03:01:09,134][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:01:09,136][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:01:09,138][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:01:10,059][__main__][INFO] - Iteration 329 took 56s (35.20% Gen, 63.15% Train). Generation: 19s, Training: 35s. Estimated remaining time: 41h 51m 22s. Estimated total time: 46h 42m 50s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 25s, 500 more iterations: 7h 47m 8s. [2025-11-13 03:01:10,061][__main__][INFO] - Starting iteration 329. [2025-11-13 03:01:10,547][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 03:01:10,547][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:01:27,625][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:01:27,740][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:01:28,478][__main__][INFO] - Number of regex retries in iteration 329: 2 [2025-11-13 03:01:28,478][__main__][INFO] - agents played in iteration 329 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:01:29,318][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:01:29,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:01:29,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:01:29,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:01:29,393][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:01:29,394][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:01:30,091][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:01:30,548][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:01:31,056][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:01:31,564][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:01:32,070][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:01:32,577][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:01:33,083][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:01:33,589][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:01:34,099][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:01:34,608][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:01:35,111][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:01:35,616][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:01:36,120][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:01:36,625][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:01:37,126][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:01:37,631][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:01:38,137][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:01:38,641][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:01:39,144][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:01:39,651][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:01:40,157][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:01:40,663][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:01:41,167][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:01:41,672][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:01:42,176][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:01:42,682][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:01:43,192][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:01:43,694][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:01:44,195][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:01:44,697][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:01:45,198][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:01:45,700][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:01:46,202][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:01:46,706][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:01:47,209][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:01:47,711][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:01:48,211][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:01:48,712][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:01:49,213][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:01:49,716][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:01:50,219][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:01:50,719][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:01:51,224][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:01:51,725][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:01:52,228][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:01:52,730][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:01:53,232][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:01:53,734][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:01:54,237][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:01:54,739][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:01:55,253][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:01:55,762][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:01:56,278][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:01:56,783][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:01:57,284][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:01:57,792][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:01:58,298][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:01:58,802][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:01:59,309][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:01:59,812][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:02:00,316][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:02:00,818][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:02:01,324][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:02:01,830][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:02:02,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10257 tokens. [2025-11-13 03:02:03,065][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.18%, ΔTime: 00:00:32 [2025-11-13 03:02:03,839][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:02:03,841][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:02:03,843][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:02:04,756][__main__][INFO] - Iteration 330 took 54s (33.08% Gen, 65.24% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 18m 6s. Estimated total time: 45h 10m 30s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 21s, 500 more iterations: 7h 31m 45s. [2025-11-13 03:02:04,758][__main__][INFO] - Starting iteration 330. [2025-11-13 03:02:05,229][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 32 and human policies 1. [2025-11-13 03:02:05,230][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:02:09,905][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:02:24,008][__main__][INFO] - Number of regex retries in iteration 330: 1 [2025-11-13 03:02:24,009][__main__][INFO] - agents played in iteration 330 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:02:24,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:02:24,917][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:02:24,942][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:02:24,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:02:24,965][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:02:24,966][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:02:25,681][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:02:26,143][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:02:26,650][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:02:27,152][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:02:27,653][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:02:28,154][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:02:28,655][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:02:29,180][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:02:29,681][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:02:30,182][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:02:30,685][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:02:31,188][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:02:31,694][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:02:32,199][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:02:32,706][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:02:33,209][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:02:33,714][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:02:34,222][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:02:34,726][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:02:35,228][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:02:35,733][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:02:36,238][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:02:36,742][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:02:37,256][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:02:37,761][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:02:38,276][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:02:38,785][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:02:39,287][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:02:39,799][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:02:40,302][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:02:40,803][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:02:41,304][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:02:41,809][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:02:42,317][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:02:42,819][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:02:43,320][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:02:43,821][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:02:44,321][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:02:44,823][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:02:45,329][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:02:45,833][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:02:46,335][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:02:46,835][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:02:47,337][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:02:47,837][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:02:48,342][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:02:48,851][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:02:49,354][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:02:49,859][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:02:50,372][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:02:50,879][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:02:51,391][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:02:51,896][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:02:52,402][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:02:52,909][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:02:53,415][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:02:53,920][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:02:54,431][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:02:54,937][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:02:55,443][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:02:55,947][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:02:56,449][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:02:56,952][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:02:57,476][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:02:57,995][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10208 tokens. [2025-11-13 03:02:58,709][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 03:02:59,494][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:02:59,496][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:02:59,498][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:03:01,300][__main__][INFO] - Iteration 331 took 56s (33.49% Gen, 63.29% Train). Generation: 18s, Training: 35s. Estimated remaining time: 41h 50m 16s. Estimated total time: 46h 43m 35s. Time estimates for 10 more iterations: 9m 20s, 100 more iterations: 1h 33m 27s, 500 more iterations: 7h 47m 15s. [2025-11-13 03:03:01,302][__main__][INFO] - Starting iteration 331. [2025-11-13 03:03:01,781][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 03:03:01,782][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:03:10,677][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls Given the per-item values, I will maximize my claim on hats since they have the highest value for me and Alice values them the least. This strategy aims to secure a high quantity of items with a high per-item value for myself, potentially maximizing my points. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:03:19,249][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:03:21,739][__main__][INFO] - Number of regex retries in iteration 331: 2 [2025-11-13 03:03:21,739][__main__][INFO] - agents played in iteration 331 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:03:22,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:03:22,598][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:03:22,623][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:03:22,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:03:22,646][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:03:22,646][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:03:23,326][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:03:23,786][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:03:24,295][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:03:24,795][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:03:25,315][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:03:25,820][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:03:26,325][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:03:26,829][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:03:27,333][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:03:27,837][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:03:28,344][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:03:28,846][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:03:29,356][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:03:29,865][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:03:30,372][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:03:30,875][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:03:31,379][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:03:31,897][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:03:32,404][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:03:32,908][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:03:33,414][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:03:33,914][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:03:34,415][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:03:34,919][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:03:35,421][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:03:35,929][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:03:36,431][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:03:36,935][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:03:37,439][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:03:37,939][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:03:38,438][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:03:38,938][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:03:39,439][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:03:39,948][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:03:40,454][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:03:40,959][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:03:41,469][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:03:41,975][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:03:42,485][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:03:42,990][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:03:43,494][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:03:44,003][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:03:44,506][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:03:45,013][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:03:45,517][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:03:46,024][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:03:46,533][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:03:47,040][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:03:47,544][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:03:48,051][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:03:48,558][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:03:49,067][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:03:49,572][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:03:50,073][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:03:50,575][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:03:51,080][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:03:51,589][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:03:52,090][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:03:52,593][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:03:53,095][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:03:53,594][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:03:54,097][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:03:54,596][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:03:55,096][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:03:55,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10377 tokens. [2025-11-13 03:03:56,457][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.38%, ΔTime: 00:00:33 [2025-11-13 03:03:57,234][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:03:57,235][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:03:57,237][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:03:58,146][__main__][INFO] - Iteration 332 took 56s (35.41% Gen, 62.98% Train). Generation: 19s, Training: 35s. Estimated remaining time: 42h 3m 58s. Estimated total time: 46h 58m 14s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 56s, 500 more iterations: 7h 49m 42s. [2025-11-13 03:03:58,148][__main__][INFO] - Starting iteration 332. [2025-11-13 03:03:58,616][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 03:03:58,617][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:04:16,270][__main__][INFO] - Number of regex retries in iteration 332: 0 [2025-11-13 03:04:16,271][__main__][INFO] - agents played in iteration 332 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:04:17,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:04:17,094][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:04:17,118][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:04:17,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:04:17,141][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:04:17,142][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:04:17,823][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:04:18,302][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:04:18,809][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:04:19,313][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:04:19,821][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:04:20,328][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:04:20,840][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:04:21,346][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:04:21,851][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:04:22,363][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:04:22,873][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:04:23,380][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:04:23,886][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:04:24,393][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:04:24,914][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:04:25,424][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:04:25,933][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:04:26,434][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:04:26,937][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:04:27,438][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:04:27,941][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:04:28,442][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:04:28,944][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:04:29,444][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:04:29,951][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:04:30,452][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:04:30,956][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:04:31,457][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:04:31,962][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:04:32,463][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:04:32,963][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:04:33,465][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:04:33,971][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:04:34,478][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:04:34,982][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:04:35,514][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:04:36,023][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:04:36,541][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:04:37,048][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:04:37,556][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:04:38,062][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:04:38,571][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:04:39,079][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:04:39,582][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:04:40,086][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:04:40,595][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:04:41,100][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:04:41,610][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:04:42,121][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:04:42,628][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:04:43,144][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:04:43,648][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:04:44,150][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:04:44,652][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:04:45,154][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:04:45,662][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:04:46,166][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:04:46,670][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:04:47,174][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:04:47,677][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:04:48,180][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:04:48,682][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:04:49,185][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:04:49,688][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:04:50,189][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10302 tokens. [2025-11-13 03:04:50,902][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.50%, ΔTime: 00:00:33 [2025-11-13 03:04:51,674][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:04:51,677][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:04:51,679][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:04:52,611][__main__][INFO] - Iteration 333 took 53s (32.69% Gen, 65.58% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 4m 37s. Estimated total time: 44h 59m 48s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 59s, 500 more iterations: 7h 29m 58s. [2025-11-13 03:04:52,613][__main__][INFO] - Starting iteration 333. [2025-11-13 03:04:53,096][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 03:04:53,096][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:05:12,647][__main__][INFO] - Number of regex retries in iteration 333: 0 [2025-11-13 03:05:12,648][__main__][INFO] - agents played in iteration 333 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:05:13,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:05:13,479][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:05:13,501][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:05:13,523][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:05:13,523][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:05:13,525][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:05:14,271][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:05:14,734][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:05:15,247][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:05:15,762][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:05:16,269][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:05:16,777][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:05:17,286][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:05:17,796][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:05:18,305][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:05:18,813][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:05:19,316][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:05:19,820][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:05:20,323][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:05:20,824][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:05:21,329][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:05:21,832][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:05:22,358][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:05:22,857][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:05:23,357][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:05:23,857][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:05:24,359][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:05:24,861][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:05:25,361][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:05:25,866][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:05:26,366][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:05:26,872][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:05:27,376][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:05:27,876][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:05:28,379][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:05:28,881][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:05:29,383][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:05:29,885][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:05:30,394][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:05:30,898][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:05:31,403][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:05:31,917][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:05:32,418][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:05:32,933][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:05:33,438][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:05:33,942][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:05:34,444][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:05:34,951][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:05:35,463][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:05:35,968][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:05:36,474][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:05:36,982][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:05:37,489][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:05:37,997][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:05:38,503][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:05:39,005][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:05:39,520][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:05:40,026][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:05:40,532][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:05:41,038][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:05:41,540][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:05:42,047][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:05:42,550][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:05:43,053][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:05:43,564][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:05:44,069][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:05:44,571][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:05:45,076][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:05:45,581][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:05:46,091][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:05:46,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10387 tokens. [2025-11-13 03:05:47,307][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:33 [2025-11-13 03:05:48,072][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:05:48,073][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:05:48,075][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:05:49,021][__main__][INFO] - Iteration 334 took 55s (34.96% Gen, 63.35% Train). Generation: 19s, Training: 35s. Estimated remaining time: 41h 40m 9s. Estimated total time: 46h 36m 16s. Time estimates for 10 more iterations: 9m 19s, 100 more iterations: 1h 33m 12s, 500 more iterations: 7h 46m 2s. [2025-11-13 03:05:49,023][__main__][INFO] - Starting iteration 334. [2025-11-13 03:05:49,503][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 03:05:49,503][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:06:08,102][__main__][INFO] - Number of regex retries in iteration 334: 0 [2025-11-13 03:06:08,103][__main__][INFO] - agents played in iteration 334 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:06:08,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:06:08,922][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:06:08,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:06:08,967][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:06:08,967][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:06:08,968][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:06:09,665][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:06:10,124][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:06:10,631][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:06:11,153][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:06:11,658][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:06:12,162][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:06:12,663][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:06:13,170][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:06:13,678][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:06:14,188][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:06:14,691][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:06:15,199][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:06:15,704][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:06:16,211][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:06:16,714][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:06:17,218][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:06:17,726][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:06:18,226][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:06:18,731][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:06:19,236][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:06:19,737][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:06:20,248][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:06:20,751][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:06:21,251][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:06:21,756][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:06:22,258][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:06:22,762][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:06:23,265][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:06:23,766][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:06:24,274][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:06:24,776][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:06:25,277][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:06:25,783][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:06:26,286][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:06:26,793][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:06:27,295][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:06:27,800][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:06:28,308][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:06:28,809][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:06:29,312][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:06:29,826][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:06:30,329][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:06:30,834][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:06:31,336][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:06:31,838][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:06:32,356][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:06:32,870][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:06:33,380][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:06:33,884][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:06:34,388][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:06:34,891][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:06:35,394][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:06:35,895][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:06:36,404][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:06:36,907][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:06:37,412][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:06:37,915][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:06:38,416][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:06:38,918][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:06:39,418][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:06:39,921][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:06:40,424][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:06:40,926][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:06:41,439][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:06:41,942][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10457 tokens. [2025-11-13 03:06:42,670][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 03:06:43,455][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:06:43,457][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:06:43,458][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:06:44,379][__main__][INFO] - Iteration 335 took 54s (33.89% Gen, 64.43% Train). Generation: 18s, Training: 35s. Estimated remaining time: 40h 46m 47s. Estimated total time: 45h 43m 50s. Time estimates for 10 more iterations: 9m 8s, 100 more iterations: 1h 31m 27s, 500 more iterations: 7h 37m 18s. [2025-11-13 03:06:44,381][__main__][INFO] - Starting iteration 335. [2025-11-13 03:06:44,850][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 03:06:44,851][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:07:04,232][__main__][INFO] - Number of regex retries in iteration 335: 0 [2025-11-13 03:07:04,233][__main__][INFO] - agents played in iteration 335 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:07:05,081][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:07:05,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:07:05,135][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:07:05,159][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:07:05,160][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:07:05,161][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:07:05,858][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:07:06,314][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:07:06,820][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:07:07,329][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:07:07,838][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:07:08,342][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:07:08,846][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:07:09,352][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:07:09,853][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:07:10,357][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:07:10,861][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:07:11,362][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:07:11,883][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:07:12,386][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:07:12,891][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:07:13,395][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:07:13,896][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:07:14,404][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:07:14,928][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:07:15,430][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:07:15,954][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:07:16,457][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:07:16,968][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:07:17,470][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:07:17,975][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:07:18,486][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:07:18,987][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:07:19,492][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:07:19,996][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:07:20,504][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:07:21,006][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:07:21,517][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:07:22,024][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:07:22,528][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:07:23,032][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:07:23,534][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:07:24,036][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:07:24,538][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:07:25,059][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:07:25,563][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:07:26,065][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:07:26,567][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:07:27,069][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:07:27,581][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:07:28,082][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:07:28,586][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:07:29,087][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:07:29,590][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:07:30,092][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:07:30,597][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:07:31,097][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:07:31,605][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:07:32,109][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:07:32,610][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:07:33,113][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:07:33,616][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:07:34,121][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:07:34,628][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:07:35,132][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:07:35,649][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:07:36,153][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:07:36,657][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:07:37,160][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:07:37,665][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:07:38,182][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10329 tokens. [2025-11-13 03:07:38,891][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.42%, ΔTime: 00:00:33 [2025-11-13 03:07:39,684][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:07:39,685][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:07:39,687][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:07:40,612][__main__][INFO] - Iteration 336 took 55s (34.76% Gen, 63.58% Train). Generation: 19s, Training: 35s. Estimated remaining time: 41h 30m 9s. Estimated total time: 46h 28m 8s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 56s, 500 more iterations: 7h 44m 41s. [2025-11-13 03:07:40,614][__main__][INFO] - Starting iteration 336. [2025-11-13 03:07:41,081][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 03:07:41,082][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:07:57,297][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:07:58,779][__main__][INFO] - Number of regex retries in iteration 336: 1 [2025-11-13 03:07:58,780][__main__][INFO] - agents played in iteration 336 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:07:59,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:07:59,644][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:07:59,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:07:59,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:07:59,693][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:07:59,694][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:08:00,403][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:08:00,863][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:08:01,374][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:08:01,888][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:08:02,391][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:08:02,895][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:08:03,399][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:08:03,912][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:08:04,416][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:08:04,919][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:08:05,424][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:08:05,931][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:08:06,434][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:08:06,936][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:08:07,442][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:08:07,942][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:08:08,441][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:08:08,964][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:08:09,464][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:08:09,976][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:08:10,476][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:08:10,980][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:08:11,488][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:08:11,991][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:08:12,493][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:08:12,993][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:08:13,493][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:08:13,997][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:08:14,495][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:08:15,000][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:08:15,508][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:08:16,011][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:08:16,513][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:08:17,014][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:08:17,518][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:08:18,030][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:08:18,535][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:08:19,040][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:08:19,545][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:08:20,051][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:08:20,578][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:08:21,083][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:08:21,589][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:08:22,087][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:08:22,592][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:08:23,096][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:08:23,598][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:08:24,098][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:08:24,602][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:08:25,106][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:08:25,611][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:08:26,113][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:08:26,615][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:08:27,122][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:08:27,627][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:08:28,131][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:08:28,633][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:08:29,138][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:08:29,638][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:08:30,142][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:08:30,647][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:08:31,153][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:08:31,657][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:08:32,160][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:08:32,672][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10380 tokens. [2025-11-13 03:08:33,404][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.18%, ΔTime: 00:00:33 [2025-11-13 03:08:34,190][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:08:34,192][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:08:34,194][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:08:35,148][__main__][INFO] - Iteration 337 took 54s (32.73% Gen, 65.50% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 4m 28s. Estimated total time: 45h 3m 22s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 6s, 500 more iterations: 7h 30m 33s. [2025-11-13 03:08:35,151][__main__][INFO] - Starting iteration 337. [2025-11-13 03:08:35,649][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 03:08:35,650][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:08:54,290][__main__][INFO] - Number of regex retries in iteration 337: 0 [2025-11-13 03:08:54,291][__main__][INFO] - agents played in iteration 337 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:08:55,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:08:55,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:08:55,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:08:55,198][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:08:55,198][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:08:55,199][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:08:56,063][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:08:56,528][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:08:57,034][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:08:57,541][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:08:58,042][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:08:58,550][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:08:59,056][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:08:59,561][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:09:00,064][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:09:00,566][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:09:01,073][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:09:01,580][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:09:02,084][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:09:02,588][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:09:03,090][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:09:03,594][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:09:04,096][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:09:04,600][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:09:05,105][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:09:05,609][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:09:06,110][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:09:06,614][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:09:07,117][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:09:07,624][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:09:08,131][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:09:08,637][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:09:09,145][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:09:09,651][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:09:10,161][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:09:10,674][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:09:11,179][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:09:11,694][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:09:12,199][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:09:12,701][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:09:13,210][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:09:13,715][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:09:14,217][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:09:14,722][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:09:15,222][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:09:15,729][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:09:16,229][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:09:16,732][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:09:17,233][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:09:17,732][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:09:18,235][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:09:18,736][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:09:19,236][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:09:19,740][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:09:20,241][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:09:20,740][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:09:21,251][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:09:21,752][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:09:22,263][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:09:22,771][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:09:23,273][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:09:23,778][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:09:24,283][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:09:24,789][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:09:25,290][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:09:25,791][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:09:26,294][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:09:26,798][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:09:27,301][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:09:27,803][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:09:28,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10333 tokens. [2025-11-13 03:09:29,069][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:33 [2025-11-13 03:09:29,852][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:09:29,853][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:09:29,855][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:09:30,835][__main__][INFO] - Iteration 338 took 55s (33.78% Gen, 64.44% Train). Generation: 18s, Training: 35s. Estimated remaining time: 40h 59m 30s. Estimated total time: 45h 59m 20s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 58s, 500 more iterations: 7h 39m 53s. [2025-11-13 03:09:30,837][__main__][INFO] - Starting iteration 338. [2025-11-13 03:09:31,325][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 03:09:31,326][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:09:48,485][__main__][INFO] - Number of regex retries in iteration 338: 0 [2025-11-13 03:09:48,486][__main__][INFO] - agents played in iteration 338 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:09:49,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:09:49,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:09:49,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:09:49,431][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:09:49,431][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:09:49,432][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:09:50,163][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:09:50,619][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:09:51,132][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:09:51,639][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:09:52,148][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:09:52,654][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:09:53,158][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:09:53,673][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:09:54,176][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:09:54,694][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:09:55,197][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:09:55,701][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:09:56,210][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:09:56,714][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:09:57,217][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:09:57,721][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:09:58,228][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:09:58,730][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:09:59,235][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:09:59,736][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:10:00,238][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:10:00,742][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:10:01,245][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:10:01,748][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:10:02,255][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:10:02,769][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:10:03,275][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:10:03,794][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:10:04,297][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:10:04,799][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:10:05,303][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:10:05,811][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:10:06,313][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:10:06,817][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:10:07,326][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:10:07,833][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:10:08,339][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:10:08,846][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:10:09,351][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:10:09,855][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:10:10,360][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:10:10,863][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:10:11,368][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:10:11,884][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:10:12,390][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:10:12,901][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:10:13,405][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:10:13,911][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:10:14,419][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:10:14,920][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:10:15,418][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:10:15,920][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:10:16,421][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:10:16,925][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:10:17,426][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:10:17,928][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:10:18,431][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:10:18,932][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:10:19,435][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:10:19,935][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:10:20,438][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:10:20,943][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:10:21,447][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:10:21,948][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:10:22,456][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10389 tokens. [2025-11-13 03:10:23,242][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.07%, Current % of VRAM taken: 58.31%, Block Peak % of device VRAM: 62.48%, ΔTime: 00:00:33 [2025-11-13 03:10:24,030][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:10:24,031][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:10:24,033][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:10:24,957][__main__][INFO] - Iteration 339 took 53s (32.00% Gen, 66.28% Train). Generation: 17s, Training: 35s. Estimated remaining time: 39h 40m 52s. Estimated total time: 44h 41m 35s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 23s, 500 more iterations: 7h 26m 55s. [2025-11-13 03:10:24,959][__main__][INFO] - Starting iteration 339. [2025-11-13 03:10:25,444][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 03:10:25,445][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:10:34,307][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:10:46,389][__main__][INFO] - Number of regex retries in iteration 339: 1 [2025-11-13 03:10:46,390][__main__][INFO] - agents played in iteration 339 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:10:47,268][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:10:47,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:10:47,322][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:10:47,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:10:47,345][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:10:47,346][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:10:48,025][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:10:48,499][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:10:49,009][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:10:49,510][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:10:50,015][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:10:50,518][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:10:51,024][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:10:51,529][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:10:52,030][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:10:52,537][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:10:53,039][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:10:53,540][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:10:54,045][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:10:54,553][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:10:55,055][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:10:55,563][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:10:56,070][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:10:56,582][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:10:57,111][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:10:57,617][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:10:58,118][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:10:58,620][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:10:59,126][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:10:59,631][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:11:00,133][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:11:00,642][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:11:01,144][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:11:01,650][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:11:02,157][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:11:02,662][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:11:03,167][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:11:03,672][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:11:04,174][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:11:04,683][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:11:05,189][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:11:05,698][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:11:06,199][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:11:06,703][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:11:07,209][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:11:07,711][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:11:08,214][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:11:08,725][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:11:09,231][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:11:09,735][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:11:10,238][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:11:10,743][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:11:11,249][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:11:11,783][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:11:12,291][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:11:12,790][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:11:13,289][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:11:13,792][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:11:14,295][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:11:14,800][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:11:15,299][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:11:15,802][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:11:16,330][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:11:16,836][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:11:17,346][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:11:17,855][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:11:18,356][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:11:18,860][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:11:19,364][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:11:19,868][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:11:20,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10284 tokens. [2025-11-13 03:11:21,171][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:33 [2025-11-13 03:11:21,966][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:11:21,968][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:11:21,969][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:11:22,886][__main__][INFO] - Iteration 340 took 57s (36.46% Gen, 61.94% Train). Generation: 20s, Training: 35s. Estimated remaining time: 42h 50m 26s. Estimated total time: 47h 52m 8s. Time estimates for 10 more iterations: 9m 34s, 100 more iterations: 1h 35m 44s, 500 more iterations: 7h 58m 41s. [2025-11-13 03:11:22,888][__main__][INFO] - Starting iteration 340. [2025-11-13 03:11:23,372][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 33 and human policies 1. [2025-11-13 03:11:23,373][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:11:42,686][__main__][INFO] - Number of regex retries in iteration 340: 0 [2025-11-13 03:11:42,687][__main__][INFO] - agents played in iteration 340 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:11:43,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:11:43,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:11:43,614][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:11:43,639][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:11:43,639][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:11:43,640][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:11:44,319][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:11:44,781][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:11:45,285][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:11:45,790][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:11:46,303][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:11:46,808][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:11:47,314][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:11:47,835][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:11:48,338][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:11:48,842][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:11:49,345][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:11:49,850][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:11:50,361][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:11:50,865][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:11:51,372][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:11:51,875][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:11:52,382][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:11:52,891][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:11:53,395][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:11:53,902][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:11:54,408][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:11:54,914][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:11:55,417][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:11:55,922][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:11:56,425][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:11:56,931][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:11:57,433][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:11:57,934][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:11:58,435][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:11:58,939][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:11:59,442][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:11:59,947][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:12:00,448][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:12:00,950][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:12:01,452][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:12:01,956][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:12:02,460][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:12:02,963][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:12:03,468][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:12:03,971][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:12:04,475][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:12:04,976][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:12:05,480][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:12:05,987][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:12:06,491][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:12:06,996][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:12:07,500][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:12:08,002][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:12:08,523][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:12:09,029][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:12:09,532][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:12:10,038][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:12:10,546][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:12:11,050][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:12:11,554][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:12:12,060][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:12:12,564][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:12:13,065][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:12:13,567][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:12:14,068][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:12:14,571][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:12:15,076][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:12:15,581][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:12:16,083][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:12:16,595][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10311 tokens. [2025-11-13 03:12:17,358][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 03:12:18,123][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:12:18,124][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:12:18,127][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:12:19,826][__main__][INFO] - Iteration 341 took 56s (34.21% Gen, 62.78% Train). Generation: 19s, Training: 35s. Estimated remaining time: 42h 0m 6s. Estimated total time: 47h 2m 44s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 5s, 500 more iterations: 7h 50m 27s. [2025-11-13 03:12:19,828][__main__][INFO] - Starting iteration 341. [2025-11-13 03:12:20,305][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 03:12:20,306][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:12:38,970][__main__][INFO] - Number of regex retries in iteration 341: 0 [2025-11-13 03:12:38,971][__main__][INFO] - agents played in iteration 341 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:12:39,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:12:39,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:12:39,890][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:12:39,914][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:12:39,915][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:12:39,916][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:12:40,591][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:12:41,053][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:12:41,556][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:12:42,058][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:12:42,559][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:12:43,063][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:12:43,571][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:12:44,080][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:12:44,585][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:12:45,090][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:12:45,594][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:12:46,096][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:12:46,615][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:12:47,118][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:12:47,620][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:12:48,125][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:12:48,631][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:12:49,141][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:12:49,644][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:12:50,147][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:12:50,656][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:12:51,161][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:12:51,662][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:12:52,163][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:12:52,665][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:12:53,171][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:12:53,671][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:12:54,172][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:12:54,672][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:12:55,174][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:12:55,676][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:12:56,176][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:12:56,675][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:12:57,179][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:12:57,682][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:12:58,187][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:12:58,693][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:12:59,202][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:12:59,706][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:13:00,213][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:13:00,716][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:13:01,221][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:13:01,727][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:13:02,244][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:13:02,746][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:13:03,251][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:13:03,753][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:13:04,255][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:13:04,760][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:13:05,265][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:13:05,771][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:13:06,280][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:13:06,783][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:13:07,322][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:13:07,830][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:13:08,336][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:13:08,841][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:13:09,342][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:13:09,848][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:13:10,352][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:13:10,855][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:13:11,361][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:13:11,864][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:13:12,366][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:13:12,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10222 tokens. [2025-11-13 03:13:13,621][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 03:13:14,390][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:13:14,391][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:13:14,393][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:13:15,314][__main__][INFO] - Iteration 342 took 55s (33.93% Gen, 64.39% Train). Generation: 18s, Training: 35s. Estimated remaining time: 40h 46m 53s. Estimated total time: 45h 50m 26s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 40s, 500 more iterations: 7h 38m 24s. [2025-11-13 03:13:15,316][__main__][INFO] - Starting iteration 342. [2025-11-13 03:13:15,782][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 03:13:15,782][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:13:19,725][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:13:32,906][__main__][INFO] - Number of regex retries in iteration 342: 1 [2025-11-13 03:13:32,907][__main__][INFO] - agents played in iteration 342 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:13:33,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:13:33,785][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:13:33,811][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:13:33,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:13:33,834][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:13:33,835][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:13:34,517][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:13:34,976][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:13:35,483][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:13:36,023][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:13:36,527][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:13:37,029][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:13:37,531][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:13:38,036][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:13:38,548][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:13:39,047][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:13:39,550][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:13:40,059][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:13:40,560][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:13:41,065][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:13:41,571][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:13:42,077][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:13:42,581][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:13:43,084][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:13:43,587][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:13:44,093][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:13:44,601][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:13:45,137][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:13:45,637][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:13:46,139][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:13:46,644][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:13:47,151][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:13:47,658][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:13:48,161][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:13:48,670][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:13:49,172][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:13:49,677][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:13:50,183][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:13:50,686][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:13:51,191][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:13:51,700][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:13:52,204][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:13:52,710][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:13:53,213][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:13:53,715][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:13:54,233][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:13:54,736][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:13:55,276][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:13:55,783][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:13:56,291][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:13:56,816][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:13:57,319][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:13:57,824][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:13:58,350][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:13:58,856][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:13:59,362][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:13:59,859][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:14:00,361][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:14:00,866][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:14:01,375][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:14:01,880][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:14:02,384][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:14:02,889][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:14:03,403][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:14:03,909][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:14:04,422][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:14:04,928][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:14:05,437][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:14:05,942][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:14:06,454][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:14:06,961][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10325 tokens. [2025-11-13 03:14:07,761][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.43%, ΔTime: 00:00:33 [2025-11-13 03:14:08,514][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:14:08,516][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:14:08,517][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:14:09,471][__main__][INFO] - Iteration 343 took 53s (31.89% Gen, 66.33% Train). Generation: 17s, Training: 35s. Estimated remaining time: 39h 40m 3s. Estimated total time: 44h 44m 31s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 29s, 500 more iterations: 7h 27m 25s. [2025-11-13 03:14:09,473][__main__][INFO] - Starting iteration 343. [2025-11-13 03:14:09,970][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 03:14:09,971][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:14:18,335][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:14:22,772][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:14:29,749][__main__][INFO] - Number of regex retries in iteration 343: 2 [2025-11-13 03:14:29,749][__main__][INFO] - agents played in iteration 343 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:14:30,564][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:14:30,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:14:30,614][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:14:30,637][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:14:30,638][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:14:30,639][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:14:31,366][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:14:31,828][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:14:32,341][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:14:32,852][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:14:33,354][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:14:33,872][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:14:34,380][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:14:34,888][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:14:35,395][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:14:35,899][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:14:36,406][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:14:36,911][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:14:37,415][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:14:37,918][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:14:38,424][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:14:38,925][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:14:39,427][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:14:39,932][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:14:40,461][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:14:40,966][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:14:41,470][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:14:41,979][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:14:42,482][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:14:42,990][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:14:43,496][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:14:43,998][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:14:44,502][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:14:45,004][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:14:45,510][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:14:46,012][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:14:46,516][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:14:47,027][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:14:47,531][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:14:48,032][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:14:48,542][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:14:49,047][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:14:49,552][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:14:50,057][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:14:50,561][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:14:51,064][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:14:51,569][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:14:52,076][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:14:52,583][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:14:53,088][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:14:53,595][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:14:54,100][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:14:54,601][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:14:55,108][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:14:55,611][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:14:56,117][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:14:56,625][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:14:57,131][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:14:57,635][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:14:58,144][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:14:58,658][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:14:59,165][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:14:59,674][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:15:00,182][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:15:00,689][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:15:01,200][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:15:01,709][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:15:02,215][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:15:02,719][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:15:03,227][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:15:03,730][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10325 tokens. [2025-11-13 03:15:04,476][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 03:15:05,261][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:15:05,262][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:15:05,264][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:15:06,213][__main__][INFO] - Iteration 344 took 56s (35.17% Gen, 63.14% Train). Generation: 19s, Training: 35s. Estimated remaining time: 41h 46m 46s. Estimated total time: 46h 52m 10s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 44s, 500 more iterations: 7h 48m 41s. [2025-11-13 03:15:06,215][__main__][INFO] - Starting iteration 344. [2025-11-13 03:15:06,690][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 03:15:06,691][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:15:24,469][__main__][INFO] - Number of regex retries in iteration 344: 0 [2025-11-13 03:15:24,470][__main__][INFO] - agents played in iteration 344 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:15:25,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:15:25,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:15:25,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:15:25,346][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:15:25,346][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:15:25,347][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:15:26,096][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:15:26,556][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:15:27,069][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:15:27,582][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:15:28,089][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:15:28,596][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:15:29,108][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:15:29,611][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:15:30,114][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:15:30,620][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:15:31,130][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:15:31,647][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:15:32,151][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:15:32,653][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:15:33,158][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:15:33,660][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:15:34,168][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:15:34,673][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:15:35,200][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:15:35,902][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:15:36,411][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:15:36,927][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:15:37,434][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:15:37,940][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:15:38,457][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:15:38,960][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:15:39,469][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:15:39,974][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:15:40,477][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:15:40,986][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:15:41,490][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:15:41,995][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:15:42,497][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:15:43,001][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:15:43,509][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:15:44,011][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:15:44,516][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:15:45,020][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:15:45,523][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:15:46,026][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:15:46,528][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:15:47,029][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:15:47,533][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:15:48,036][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:15:48,538][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:15:49,037][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:15:49,538][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:15:50,059][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:15:50,569][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:15:51,083][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:15:51,590][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:15:52,097][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:15:52,604][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:15:53,111][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:15:53,615][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:15:54,122][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:15:54,627][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:15:55,131][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:15:55,632][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:15:56,133][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:15:56,640][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:15:57,141][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:15:57,641][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:15:58,148][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:15:58,654][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10373 tokens. [2025-11-13 03:15:59,397][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.48%, ΔTime: 00:00:33 [2025-11-13 03:16:00,145][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:16:00,147][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:16:00,149][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:16:01,092][__main__][INFO] - Iteration 345 took 54s (32.68% Gen, 65.58% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 13m 48s. Estimated total time: 45h 20m 7s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 40s, 500 more iterations: 7h 33m 21s. [2025-11-13 03:16:01,095][__main__][INFO] - Starting iteration 345. [2025-11-13 03:16:01,576][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 03:16:01,577][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:16:19,088][__main__][INFO] - Number of regex retries in iteration 345: 0 [2025-11-13 03:16:19,088][__main__][INFO] - agents played in iteration 345 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:16:19,955][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:16:19,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:16:20,006][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:16:20,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:16:20,029][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:16:20,030][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:16:20,756][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:16:21,219][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:16:21,732][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:16:22,242][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:16:22,751][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:16:23,259][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:16:23,766][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:16:24,275][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:16:24,782][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:16:25,298][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:16:25,805][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:16:26,309][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:16:26,814][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:16:27,320][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:16:27,824][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:16:28,331][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:16:28,839][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:16:29,353][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:16:29,862][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:16:30,378][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:16:30,883][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:16:31,393][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:16:31,902][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:16:32,406][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:16:32,911][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:16:33,415][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:16:33,921][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:16:34,427][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:16:34,933][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:16:35,440][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:16:35,945][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:16:36,453][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:16:36,985][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:16:37,492][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:16:38,001][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:16:38,506][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:16:39,009][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:16:39,516][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:16:40,020][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:16:40,524][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:16:41,030][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:16:41,540][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:16:42,046][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:16:42,552][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:16:43,060][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:16:43,582][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:16:44,091][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:16:44,600][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:16:45,109][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:16:45,613][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:16:46,119][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:16:46,621][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:16:47,125][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:16:47,625][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:16:48,125][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:16:48,626][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:16:49,130][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:16:49,632][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:16:50,136][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:16:50,639][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:16:51,143][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:16:51,646][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:16:52,148][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:16:52,653][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:16:53,156][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10366 tokens. [2025-11-13 03:16:53,875][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 03:16:54,627][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:16:54,628][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:16:54,630][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:16:55,594][__main__][INFO] - Iteration 346 took 54s (32.42% Gen, 65.80% Train). Generation: 17s, Training: 35s. Estimated remaining time: 39h 53m 41s. Estimated total time: 45h 0m 55s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 1s, 500 more iterations: 7h 30m 9s. [2025-11-13 03:16:55,596][__main__][INFO] - Starting iteration 346. [2025-11-13 03:16:56,164][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 03:16:56,166][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:17:15,222][__main__][INFO] - Number of regex retries in iteration 346: 0 [2025-11-13 03:17:15,223][__main__][INFO] - agents played in iteration 346 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:17:16,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:17:16,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:17:16,123][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:17:16,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:17:16,146][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:17:16,146][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:17:16,894][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:17:17,357][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:17:17,880][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:17:18,387][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:17:18,896][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:17:19,409][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:17:19,912][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:17:20,422][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:17:20,929][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:17:21,434][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:17:21,951][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:17:22,455][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:17:22,961][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:17:23,480][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:17:23,984][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:17:24,489][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:17:24,991][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:17:25,493][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:17:25,994][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:17:26,498][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:17:27,005][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:17:27,507][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:17:28,007][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:17:28,514][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:17:29,015][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:17:29,517][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:17:30,019][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:17:30,520][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:17:31,020][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:17:31,521][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:17:32,023][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:17:32,529][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:17:33,032][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:17:33,536][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:17:34,043][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:17:34,553][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:17:35,057][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:17:35,565][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:17:36,068][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:17:36,585][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:17:37,092][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:17:37,603][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:17:38,106][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:17:38,612][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:17:39,120][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:17:39,626][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:17:40,132][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:17:40,634][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:17:41,138][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:17:41,642][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:17:42,145][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:17:42,653][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:17:43,155][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:17:43,661][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:17:44,166][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:17:44,687][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:17:45,191][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:17:45,710][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:17:46,215][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:17:46,717][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:17:47,219][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:17:47,720][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:17:48,224][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:17:48,730][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:17:49,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10375 tokens. [2025-11-13 03:17:49,970][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 03:17:50,741][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:17:50,743][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:17:50,745][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:17:51,810][__main__][INFO] - Iteration 347 took 55s (34.25% Gen, 63.83% Train). Generation: 19s, Training: 35s. Estimated remaining time: 41h 14m 10s. Estimated total time: 46h 22m 20s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 44s, 500 more iterations: 7h 43m 43s. [2025-11-13 03:17:51,812][__main__][INFO] - Starting iteration 347. [2025-11-13 03:17:52,300][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 03:17:52,300][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:18:11,584][__main__][INFO] - Number of regex retries in iteration 347: 0 [2025-11-13 03:18:11,584][__main__][INFO] - agents played in iteration 347 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:18:12,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:18:12,427][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:18:12,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:18:12,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:18:12,472][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:18:12,473][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:18:13,254][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:18:13,715][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:18:14,222][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:18:14,728][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:18:15,235][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:18:15,744][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:18:16,247][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:18:16,754][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:18:17,257][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:18:17,765][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:18:18,272][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:18:18,777][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:18:19,281][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:18:19,803][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:18:20,309][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:18:20,812][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:18:21,317][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:18:21,820][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:18:22,330][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:18:22,835][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:18:23,340][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:18:23,844][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:18:24,349][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:18:24,854][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:18:25,360][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:18:25,865][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:18:26,377][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:18:26,892][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:18:27,400][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:18:27,909][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:18:28,417][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:18:28,932][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:18:29,439][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:18:29,945][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:18:30,452][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:18:30,957][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:18:31,469][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:18:31,975][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:18:32,477][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:18:32,983][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:18:33,489][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:18:33,999][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:18:34,500][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:18:35,007][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:18:35,528][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:18:36,031][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:18:36,539][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:18:37,044][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:18:37,550][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:18:38,058][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:18:38,586][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:18:39,092][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:18:39,608][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:18:40,110][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:18:40,625][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:18:41,129][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:18:41,632][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:18:42,146][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:18:42,654][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:18:43,166][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:18:43,674][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:18:44,183][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:18:44,706][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:18:45,212][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:18:45,717][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10410 tokens. [2025-11-13 03:18:46,421][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.49%, ΔTime: 00:00:33 [2025-11-13 03:18:47,189][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:18:47,191][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:18:47,193][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:18:48,149][__main__][INFO] - Iteration 348 took 55s (34.53% Gen, 63.76% Train). Generation: 19s, Training: 35s. Estimated remaining time: 41h 23m 23s. Estimated total time: 46h 32m 29s. Time estimates for 10 more iterations: 9m 18s, 100 more iterations: 1h 33m 4s, 500 more iterations: 7h 45m 24s. [2025-11-13 03:18:48,151][__main__][INFO] - Starting iteration 348. [2025-11-13 03:18:48,640][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 03:18:48,641][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:18:53,543][mllm.models.large_language_model_local][WARNING] - Response Proposal: 5 hats, 5 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:19:06,026][__main__][INFO] - Number of regex retries in iteration 348: 1 [2025-11-13 03:19:06,026][__main__][INFO] - agents played in iteration 348 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:19:06,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:19:06,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:19:06,895][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:19:06,918][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:19:06,919][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:19:06,920][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:19:07,673][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:19:08,141][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:19:08,652][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:19:09,160][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:19:09,674][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:19:10,182][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:19:10,685][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:19:11,201][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:19:11,702][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:19:12,219][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:19:12,728][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:19:13,236][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:19:13,747][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:19:14,252][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:19:14,758][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:19:15,266][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:19:15,772][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:19:16,278][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:19:16,780][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:19:17,284][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:19:17,800][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:19:18,308][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:19:18,825][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:19:19,329][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:19:19,832][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:19:20,342][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:19:20,846][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:19:21,352][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:19:21,861][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:19:22,368][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:19:22,871][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:19:23,378][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:19:23,883][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:19:24,399][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:19:24,903][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:19:25,416][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:19:25,919][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:19:26,424][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:19:26,926][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:19:27,429][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:19:27,930][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:19:28,433][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:19:28,934][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:19:29,441][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:19:29,943][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:19:30,446][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:19:30,952][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:19:31,459][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:19:31,963][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:19:32,467][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:19:32,969][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:19:33,472][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:19:33,979][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:19:34,481][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:19:34,982][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:19:35,485][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:19:36,005][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:19:36,512][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:19:37,021][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:19:37,521][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:19:38,022][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:19:38,536][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:19:39,039][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:19:39,542][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:19:40,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10357 tokens. [2025-11-13 03:19:40,810][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.48%, ΔTime: 00:00:33 [2025-11-13 03:19:41,583][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:19:41,584][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:19:41,586][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:19:42,572][__main__][INFO] - Iteration 349 took 53s (32.23% Gen, 65.93% Train). Generation: 17s, Training: 35s. Estimated remaining time: 39h 46m 35s. Estimated total time: 44h 56m 36s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 53s, 500 more iterations: 7h 29m 26s. [2025-11-13 03:19:42,574][__main__][INFO] - Starting iteration 349. [2025-11-13 03:19:43,060][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 03:19:43,060][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:20:00,835][__main__][INFO] - Number of regex retries in iteration 349: 0 [2025-11-13 03:20:00,836][__main__][INFO] - agents played in iteration 349 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:20:01,639][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:20:01,661][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:20:01,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:20:01,711][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:20:01,712][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:20:01,712][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:20:02,469][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:20:02,928][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:20:03,443][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:20:03,946][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:20:04,452][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:20:04,956][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:20:05,459][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:20:05,967][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:20:06,474][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:20:06,977][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:20:07,480][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:20:07,985][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:20:08,487][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:20:08,990][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:20:09,496][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:20:10,005][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:20:10,506][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:20:11,016][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:20:11,522][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:20:12,026][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:20:12,532][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:20:13,036][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:20:13,544][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:20:14,050][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:20:14,560][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:20:15,065][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:20:15,575][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:20:16,080][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:20:16,586][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:20:17,091][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:20:17,599][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:20:18,126][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:20:18,632][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:20:19,157][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:20:19,663][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:20:20,172][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:20:20,679][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:20:21,184][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:20:21,690][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:20:22,194][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:20:22,697][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:20:23,202][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:20:23,705][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:20:24,219][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:20:24,721][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:20:25,226][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:20:25,733][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:20:26,235][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:20:26,737][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:20:27,244][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:20:27,743][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:20:28,254][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:20:28,758][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:20:29,259][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:20:29,763][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:20:30,269][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:20:30,774][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:20:31,275][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:20:31,777][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:20:32,285][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:20:32,789][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:20:33,306][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:20:33,808][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:20:34,312][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:20:34,813][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10324 tokens. [2025-11-13 03:20:35,534][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:33 [2025-11-13 03:20:36,294][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:20:36,296][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:20:36,297][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:20:37,244][__main__][INFO] - Iteration 350 took 54s (32.81% Gen, 65.44% Train). Generation: 17s, Training: 35s. Estimated remaining time: 39h 58m 19s. Estimated total time: 45h 9m 14s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 18s, 500 more iterations: 7h 31m 32s. [2025-11-13 03:20:37,246][__main__][INFO] - Starting iteration 350. [2025-11-13 03:20:37,721][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 34 and human policies 1. [2025-11-13 03:20:37,722][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:20:55,240][__main__][INFO] - Number of regex retries in iteration 350: 0 [2025-11-13 03:20:55,241][__main__][INFO] - agents played in iteration 350 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:20:56,160][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:20:56,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:20:56,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:20:56,228][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:20:56,229][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:20:56,230][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:20:56,982][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:20:57,449][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:20:57,962][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:20:58,474][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:20:58,981][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:20:59,492][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:20:59,998][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:21:00,502][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:21:01,016][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:21:01,516][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:21:02,025][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:21:02,533][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:21:03,040][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:21:03,547][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:21:04,049][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:21:04,552][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:21:05,055][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:21:05,560][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:21:06,070][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:21:06,575][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:21:07,080][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:21:07,588][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:21:08,092][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:21:08,595][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:21:09,101][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:21:09,608][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:21:10,118][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:21:10,624][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:21:11,134][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:21:11,655][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:21:12,161][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:21:12,667][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:21:13,173][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:21:13,679][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:21:14,187][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:21:14,690][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:21:15,193][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:21:15,699][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:21:16,204][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:21:16,705][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:21:17,222][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:21:17,724][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:21:18,241][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:21:18,747][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:21:19,257][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:21:19,760][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:21:20,264][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:21:20,770][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:21:21,279][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:21:21,781][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:21:22,284][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:21:22,789][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:21:23,298][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:21:23,801][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:21:24,304][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:21:24,806][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:21:25,314][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:21:25,815][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:21:26,334][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:21:26,837][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:21:27,346][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:21:27,854][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:21:28,353][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:21:28,861][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:21:29,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10299 tokens. [2025-11-13 03:21:30,106][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 03:21:30,864][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:21:30,865][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:21:30,867][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:21:32,777][__main__][INFO] - Iteration 351 took 55s (31.82% Gen, 64.71% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 40m 58s. Estimated total time: 45h 52m 49s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 45s, 500 more iterations: 7h 38m 48s. [2025-11-13 03:21:32,779][__main__][INFO] - Starting iteration 351. [2025-11-13 03:21:33,251][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 03:21:33,251][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:21:37,515][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:21:51,049][__main__][INFO] - Number of regex retries in iteration 351: 1 [2025-11-13 03:21:51,050][__main__][INFO] - agents played in iteration 351 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:21:51,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:21:51,967][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:21:51,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:21:52,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:21:52,012][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:21:52,013][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:21:52,761][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:21:53,226][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:21:53,737][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:21:54,248][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:21:54,764][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:21:55,266][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:21:55,773][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:21:56,281][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:21:56,789][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:21:57,292][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:21:57,801][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:21:58,307][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:21:58,816][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:21:59,322][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:21:59,828][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:22:00,333][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:22:00,840][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:22:01,358][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:22:01,860][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:22:02,365][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:22:02,869][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:22:03,375][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:22:03,881][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:22:04,383][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:22:04,885][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:22:05,393][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:22:05,899][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:22:06,410][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:22:06,915][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:22:07,416][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:22:07,921][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:22:08,425][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:22:08,933][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:22:09,440][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:22:09,944][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:22:10,461][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:22:10,965][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:22:11,467][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:22:11,970][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:22:12,471][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:22:12,975][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:22:13,476][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:22:13,981][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:22:14,489][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:22:14,989][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:22:15,489][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:22:15,992][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:22:16,495][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:22:16,996][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:22:17,500][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:22:18,005][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:22:18,509][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:22:19,009][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:22:19,514][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:22:20,028][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:22:20,531][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:22:21,050][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:22:21,553][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:22:22,054][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:22:22,565][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:22:23,064][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:22:23,566][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:22:24,066][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:22:24,571][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:22:25,074][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10304 tokens. [2025-11-13 03:22:25,783][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 03:22:26,561][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:22:26,562][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:22:26,564][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:22:27,479][__main__][INFO] - Iteration 352 took 54s (32.82% Gen, 65.49% Train). Generation: 17s, Training: 35s. Estimated remaining time: 39h 58m 40s. Estimated total time: 45h 11m 26s. Time estimates for 10 more iterations: 9m 2s, 100 more iterations: 1h 30m 22s, 500 more iterations: 7h 31m 54s. [2025-11-13 03:22:27,482][__main__][INFO] - Starting iteration 352. [2025-11-13 03:22:27,953][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 03:22:27,953][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:22:32,945][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:22:34,503][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:22:46,702][__main__][INFO] - Number of regex retries in iteration 352: 2 [2025-11-13 03:22:46,703][__main__][INFO] - agents played in iteration 352 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:22:47,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:22:47,605][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:22:47,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:22:47,650][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:22:47,651][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:22:47,651][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:22:48,435][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:22:48,908][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:22:49,419][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:22:49,922][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:22:50,425][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:22:50,928][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:22:51,455][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:22:51,956][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:22:52,458][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:22:52,960][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:22:53,465][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:22:53,978][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:22:54,485][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:22:54,991][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:22:55,500][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:22:56,070][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:22:56,582][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:22:57,085][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:22:57,588][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:22:58,096][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:22:58,608][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:22:59,113][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:22:59,617][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:23:00,122][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:23:00,647][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:23:01,152][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:23:01,657][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:23:02,162][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:23:02,667][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:23:03,179][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:23:03,681][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:23:04,184][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:23:04,693][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:23:05,204][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:23:05,705][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:23:06,208][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:23:06,711][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:23:07,216][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:23:07,719][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:23:08,221][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:23:08,727][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:23:09,233][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:23:09,738][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:23:10,238][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:23:10,738][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:23:11,254][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:23:11,757][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:23:12,259][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:23:12,765][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:23:13,270][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:23:13,783][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:23:14,288][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:23:14,810][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:23:15,317][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:23:15,820][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:23:16,327][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:23:16,838][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:23:17,343][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:23:17,868][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:23:18,373][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:23:18,876][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:23:19,378][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:23:19,880][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:23:20,388][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:23:20,891][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10342 tokens. [2025-11-13 03:23:21,630][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 03:23:22,402][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:23:22,403][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:23:22,405][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:23:23,405][__main__][INFO] - Iteration 353 took 55s (33.81% Gen, 64.38% Train). Generation: 18s, Training: 35s. Estimated remaining time: 40h 58m 59s. Estimated total time: 46h 12m 40s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 25s, 500 more iterations: 7h 42m 6s. [2025-11-13 03:23:23,408][__main__][INFO] - Starting iteration 353. [2025-11-13 03:23:23,891][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 03:23:23,891][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:23:42,799][__main__][INFO] - Number of regex retries in iteration 353: 0 [2025-11-13 03:23:42,799][__main__][INFO] - agents played in iteration 353 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:23:43,702][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:23:43,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:23:43,755][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:23:43,787][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:23:43,788][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:23:43,790][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:23:44,556][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:23:45,019][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:23:45,536][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:23:46,040][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:23:46,558][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:23:47,057][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:23:47,558][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:23:48,062][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:23:48,568][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:23:49,072][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:23:49,581][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:23:50,085][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:23:50,593][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:23:51,101][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:23:51,605][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:23:52,110][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:23:52,615][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:23:53,121][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:23:53,622][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:23:54,130][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:23:54,636][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:23:55,139][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:23:55,643][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:23:56,152][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:23:56,662][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:23:57,177][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:23:57,680][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:23:58,184][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:23:58,695][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:23:59,203][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:23:59,707][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:24:00,215][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:24:00,716][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:24:01,224][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:24:01,726][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:24:02,227][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:24:02,727][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:24:03,226][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:24:03,729][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:24:04,231][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:24:04,736][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:24:05,239][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:24:05,742][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:24:06,242][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:24:06,744][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:24:07,248][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:24:07,767][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:24:08,268][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:24:08,779][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:24:09,280][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:24:09,784][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:24:10,284][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:24:10,789][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:24:11,293][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:24:11,802][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:24:12,307][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:24:12,814][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:24:13,314][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:24:13,814][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:24:14,314][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:24:14,812][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:24:15,312][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:24:15,814][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:24:16,316][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:24:16,817][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10286 tokens. [2025-11-13 03:24:17,596][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 03:24:18,384][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:24:18,386][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:24:18,388][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:24:19,341][__main__][INFO] - Iteration 354 took 55s (34.10% Gen, 64.18% Train). Generation: 18s, Training: 35s. Estimated remaining time: 40h 57m 53s. Estimated total time: 46h 12m 31s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 25s, 500 more iterations: 7h 42m 5s. [2025-11-13 03:24:19,343][__main__][INFO] - Starting iteration 354. [2025-11-13 03:24:19,845][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 03:24:19,846][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:24:38,657][__main__][INFO] - Number of regex retries in iteration 354: 0 [2025-11-13 03:24:38,658][__main__][INFO] - agents played in iteration 354 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:24:39,531][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:24:39,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:24:39,583][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:24:39,606][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:24:39,607][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:24:39,608][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:24:40,322][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:24:40,780][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:24:41,291][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:24:41,802][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:24:42,302][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:24:42,808][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:24:43,312][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:24:43,812][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:24:44,318][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:24:44,820][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:24:45,323][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:24:45,829][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:24:46,336][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:24:46,856][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:24:47,361][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:24:47,867][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:24:48,373][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:24:48,877][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:24:49,387][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:24:49,894][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:24:50,397][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:24:50,907][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:24:51,415][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:24:51,920][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:24:52,422][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:24:52,924][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:24:53,428][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:24:53,933][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:24:54,438][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:24:54,949][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:24:55,457][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:24:55,963][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:24:56,465][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:24:56,967][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:24:57,472][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:24:57,977][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:24:58,480][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:24:58,981][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:24:59,484][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:24:59,995][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:25:00,496][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:25:01,002][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:25:01,508][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:25:02,011][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:25:02,521][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:25:03,023][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:25:03,530][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:25:04,046][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:25:04,549][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:25:05,063][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:25:05,568][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:25:06,073][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:25:06,579][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:25:07,083][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:25:07,586][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:25:08,091][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:25:08,593][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:25:09,103][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:25:09,603][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:25:10,109][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:25:10,613][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:25:11,121][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:25:11,625][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:25:12,131][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:25:12,636][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10240 tokens. [2025-11-13 03:25:13,438][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.05%, Current % of VRAM taken: 58.30%, Block Peak % of device VRAM: 62.18%, ΔTime: 00:00:33 [2025-11-13 03:25:14,204][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:25:14,206][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:25:14,208][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:25:15,096][__main__][INFO] - Iteration 355 took 55s (34.05% Gen, 64.34% Train). Generation: 18s, Training: 35s. Estimated remaining time: 40h 47m 1s. Estimated total time: 46h 2m 35s. Time estimates for 10 more iterations: 9m 12s, 100 more iterations: 1h 32m 5s, 500 more iterations: 7h 40m 25s. [2025-11-13 03:25:15,099][__main__][INFO] - Starting iteration 355. [2025-11-13 03:25:15,585][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 03:25:15,586][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:25:19,708][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:25:33,547][__main__][INFO] - Number of regex retries in iteration 355: 1 [2025-11-13 03:25:33,548][__main__][INFO] - agents played in iteration 355 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:25:34,370][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:25:34,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:25:34,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:25:34,445][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:25:34,446][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:25:34,447][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:25:35,136][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:25:35,759][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:25:36,265][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:25:36,787][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:25:37,292][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:25:37,799][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:25:38,305][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:25:38,817][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:25:39,326][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:25:39,831][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:25:40,340][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:25:40,851][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:25:41,357][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:25:41,869][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:25:42,375][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:25:42,878][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:25:43,388][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:25:43,896][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:25:44,402][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:25:44,907][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:25:45,412][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:25:45,922][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:25:46,424][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:25:46,927][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:25:47,432][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:25:47,938][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:25:48,441][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:25:48,946][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:25:49,447][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:25:49,951][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:25:50,457][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:25:50,960][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:25:51,461][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:25:51,966][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:25:52,484][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:25:52,987][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:25:53,489][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:25:53,993][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:25:54,497][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:25:55,004][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:25:55,505][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:25:56,010][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:25:56,518][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:25:57,023][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:25:57,524][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:25:58,027][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:25:58,535][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:25:59,041][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:25:59,546][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:26:00,048][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:26:00,552][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:26:01,056][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:26:01,570][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:26:02,074][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:26:02,581][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:26:03,087][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:26:03,592][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:26:04,100][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:26:04,604][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:26:05,105][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:26:05,614][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:26:06,118][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:26:06,624][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:26:07,131][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:26:07,634][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10450 tokens. [2025-11-13 03:26:08,411][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:33 [2025-11-13 03:26:09,169][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:26:09,171][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:26:09,172][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:26:10,111][__main__][INFO] - Iteration 356 took 54s (32.94% Gen, 65.34% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 9m 48s. Estimated total time: 45h 26m 17s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 52s, 500 more iterations: 7h 34m 22s. [2025-11-13 03:26:10,113][__main__][INFO] - Starting iteration 356. [2025-11-13 03:26:10,587][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 03:26:10,588][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:26:27,473][__main__][INFO] - Number of regex retries in iteration 356: 0 [2025-11-13 03:26:27,473][__main__][INFO] - agents played in iteration 356 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:26:28,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:26:28,283][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:26:28,305][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:26:28,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:26:28,328][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:26:28,329][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:26:29,028][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:26:29,490][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:26:30,001][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:26:30,517][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:26:31,021][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:26:31,532][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:26:32,039][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:26:32,545][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:26:33,054][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:26:33,556][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:26:34,061][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:26:34,577][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:26:35,083][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:26:35,606][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:26:36,117][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:26:36,622][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:26:37,133][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:26:37,636][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:26:38,139][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:26:38,643][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:26:39,152][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:26:39,658][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:26:40,159][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:26:40,666][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:26:41,184][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:26:41,694][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:26:42,200][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:26:42,704][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:26:43,210][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:26:43,716][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:26:44,216][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:26:44,725][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:26:45,230][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:26:45,730][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:26:46,235][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:26:46,736][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:26:47,237][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:26:47,737][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:26:48,237][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:26:48,739][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:26:49,241][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:26:49,746][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:26:50,254][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:26:50,757][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:26:51,261][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:26:51,766][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:26:52,275][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:26:52,805][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:26:53,316][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:26:53,820][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:26:54,322][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:26:54,828][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:26:55,335][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:26:55,840][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:26:56,341][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:26:56,849][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:26:57,352][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:26:57,859][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:26:58,363][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:26:58,867][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:26:59,371][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:26:59,879][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:27:00,383][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:27:00,888][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:27:01,393][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10342 tokens. [2025-11-13 03:27:02,142][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:33 [2025-11-13 03:27:02,908][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:27:02,909][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:27:02,912][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:27:03,867][__main__][INFO] - Iteration 357 took 53s (31.69% Gen, 66.51% Train). Generation: 16s, Training: 35s. Estimated remaining time: 39h 6m 38s. Estimated total time: 44h 24m 1s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 48s, 500 more iterations: 7h 24m 0s. [2025-11-13 03:27:03,869][__main__][INFO] - Starting iteration 357. [2025-11-13 03:27:04,337][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 03:27:04,338][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:27:16,885][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:27:23,393][__main__][INFO] - Number of regex retries in iteration 357: 1 [2025-11-13 03:27:23,394][__main__][INFO] - agents played in iteration 357 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:27:24,252][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:27:24,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:27:24,305][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:27:24,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:27:24,329][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:27:24,330][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:27:25,037][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:27:25,507][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:27:26,020][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:27:26,530][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:27:27,037][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:27:27,543][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:27:28,051][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:27:28,560][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:27:29,065][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:27:29,575][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:27:30,080][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:27:30,587][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:27:31,104][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:27:31,612][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:27:32,127][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:27:32,634][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:27:33,137][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:27:33,644][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:27:34,149][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:27:34,664][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:27:35,170][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:27:35,679][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:27:36,215][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:27:36,717][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:27:37,225][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:27:37,738][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:27:38,243][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:27:38,751][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:27:39,259][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:27:39,767][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:27:40,275][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:27:40,785][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:27:41,290][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:27:41,796][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:27:42,298][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:27:42,809][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:27:43,314][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:27:43,823][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:27:44,325][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:27:44,827][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:27:45,330][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:27:45,833][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:27:46,347][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:27:46,850][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:27:47,354][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:27:47,868][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:27:48,374][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:27:48,880][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:27:49,396][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:27:49,903][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:27:50,422][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:27:50,924][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:27:51,434][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:27:51,935][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:27:52,442][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:27:52,951][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:27:53,455][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:27:53,962][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:27:54,473][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:27:54,981][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:27:55,491][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:27:55,995][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:27:56,500][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:27:57,020][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:27:57,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10264 tokens. [2025-11-13 03:27:58,280][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 03:27:59,059][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:27:59,061][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:27:59,063][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:28:00,040][__main__][INFO] - Iteration 358 took 55s (34.21% Gen, 64.03% Train). Generation: 19s, Training: 35s. Estimated remaining time: 41h 6m 49s. Estimated total time: 46h 25m 8s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 50s, 500 more iterations: 7h 44m 11s. [2025-11-13 03:28:00,042][__main__][INFO] - Starting iteration 358. [2025-11-13 03:28:00,523][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 03:28:00,524][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:28:12,679][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:28:18,962][__main__][INFO] - Number of regex retries in iteration 358: 1 [2025-11-13 03:28:18,963][__main__][INFO] - agents played in iteration 358 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:28:19,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:28:19,769][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:28:19,793][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:28:19,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:28:19,816][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:28:19,817][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:28:20,520][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:28:20,978][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:28:21,487][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:28:21,989][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:28:22,497][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:28:23,002][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:28:23,509][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:28:24,017][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:28:24,537][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:28:25,037][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:28:25,544][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:28:26,045][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:28:26,555][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:28:27,061][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:28:27,564][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:28:28,068][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:28:28,578][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:28:29,085][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:28:29,588][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:28:30,102][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:28:30,605][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:28:31,127][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:28:31,635][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:28:32,142][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:28:32,643][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:28:33,145][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:28:33,652][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:28:34,155][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:28:34,658][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:28:35,165][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:28:35,668][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:28:36,173][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:28:36,679][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:28:37,183][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:28:37,689][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:28:38,198][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:28:38,704][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:28:39,222][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:28:39,730][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:28:40,238][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:28:40,746][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:28:41,255][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:28:41,765][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:28:42,269][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:28:42,776][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:28:43,284][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:28:43,810][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:28:44,326][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:28:44,836][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:28:45,347][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:28:45,849][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:28:46,354][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:28:46,864][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:28:47,372][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:28:47,878][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:28:48,394][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:28:48,900][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:28:49,403][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:28:49,917][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:28:50,420][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:28:50,935][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:28:51,439][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:28:51,940][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:28:52,450][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:28:52,955][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10364 tokens. [2025-11-13 03:28:53,728][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 03:28:54,504][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:28:54,506][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:28:54,507][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:28:55,505][__main__][INFO] - Iteration 359 took 54s (33.54% Gen, 64.65% Train). Generation: 18s, Training: 35s. Estimated remaining time: 40h 29m 53s. Estimated total time: 45h 49m 7s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 38s, 500 more iterations: 7h 38m 11s. [2025-11-13 03:28:55,507][__main__][INFO] - Starting iteration 359. [2025-11-13 03:28:56,124][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 03:28:56,124][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:29:13,555][__main__][INFO] - Number of regex retries in iteration 359: 0 [2025-11-13 03:29:13,556][__main__][INFO] - agents played in iteration 359 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:29:14,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:29:14,382][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:29:14,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:29:14,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:29:14,434][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:29:14,435][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:29:15,158][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:29:15,617][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:29:16,133][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:29:16,637][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:29:17,148][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:29:17,652][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:29:18,159][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:29:18,668][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:29:19,169][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:29:19,679][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:29:20,185][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:29:20,688][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:29:21,193][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:29:21,701][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:29:22,203][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:29:22,711][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:29:23,217][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:29:23,742][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:29:24,250][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:29:24,758][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:29:25,274][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:29:25,782][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:29:26,287][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:29:26,799][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:29:27,302][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:29:27,803][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:29:28,307][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:29:28,805][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:29:29,327][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:29:29,828][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:29:30,331][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:29:30,834][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:29:31,341][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:29:31,863][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:29:32,366][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:29:32,874][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:29:33,382][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:29:33,885][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:29:34,387][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:29:34,889][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:29:35,393][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:29:35,902][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:29:36,409][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:29:36,911][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:29:37,429][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:29:37,932][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:29:38,453][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:29:38,961][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:29:39,471][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:29:39,981][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:29:40,489][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:29:41,000][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:29:41,505][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:29:42,011][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:29:42,525][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:29:43,031][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:29:43,551][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:29:44,057][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:29:44,559][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:29:45,067][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:29:45,570][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:29:46,073][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:29:46,580][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:29:47,085][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:29:47,591][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10309 tokens. [2025-11-13 03:29:48,321][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.38%, ΔTime: 00:00:33 [2025-11-13 03:29:49,108][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:29:49,110][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:29:49,111][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:29:50,084][__main__][INFO] - Iteration 360 took 53s (32.30% Gen, 65.89% Train). Generation: 17s, Training: 35s. Estimated remaining time: 39h 37m 52s. Estimated total time: 44h 58m 1s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 56s, 500 more iterations: 7h 29m 40s. [2025-11-13 03:29:50,086][__main__][INFO] - Starting iteration 360. [2025-11-13 03:29:50,564][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 35 and human policies 1. [2025-11-13 03:29:50,565][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:30:08,291][__main__][INFO] - Number of regex retries in iteration 360: 0 [2025-11-13 03:30:08,292][__main__][INFO] - agents played in iteration 360 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:30:09,101][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:30:09,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:30:09,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:30:09,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:30:09,175][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:30:09,176][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:30:09,921][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:30:10,382][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:30:10,897][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:30:11,400][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:30:11,906][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:30:12,408][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:30:12,914][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:30:13,417][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:30:13,920][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:30:14,424][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:30:14,930][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:30:15,435][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:30:15,947][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:30:16,449][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:30:16,955][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:30:17,458][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:30:17,961][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:30:18,473][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:30:18,975][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:30:19,479][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:30:19,981][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:30:20,485][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:30:20,990][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:30:21,492][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:30:21,995][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:30:22,500][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:30:23,005][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:30:23,510][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:30:24,014][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:30:24,514][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:30:25,016][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:30:25,516][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:30:26,017][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:30:26,521][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:30:27,024][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:30:27,551][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:30:28,054][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:30:28,559][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:30:29,068][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:30:29,576][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:30:30,081][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:30:30,586][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:30:31,090][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:30:31,593][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:30:32,098][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:30:32,602][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:30:33,106][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:30:33,612][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:30:34,121][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:30:34,626][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:30:35,132][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:30:35,637][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:30:36,139][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:30:36,642][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:30:37,151][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:30:37,651][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:30:38,166][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:30:38,671][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:30:39,175][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:30:39,679][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:30:40,179][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:30:40,682][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:30:41,182][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:30:41,683][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:30:42,188][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10211 tokens. [2025-11-13 03:30:42,923][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.04%, Current % of VRAM taken: 58.29%, Block Peak % of device VRAM: 62.01%, ΔTime: 00:00:33 [2025-11-13 03:30:43,703][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:30:43,705][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:30:43,707][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:30:45,572][__main__][INFO] - Iteration 361 took 55s (32.23% Gen, 64.38% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 29m 21s. Estimated total time: 45h 50m 25s. Time estimates for 10 more iterations: 9m 10s, 100 more iterations: 1h 31m 40s, 500 more iterations: 7h 38m 24s. [2025-11-13 03:30:45,575][__main__][INFO] - Starting iteration 361. [2025-11-13 03:30:46,064][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 03:30:46,065][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:30:58,099][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:31:04,176][__main__][INFO] - Number of regex retries in iteration 361: 1 [2025-11-13 03:31:04,177][__main__][INFO] - agents played in iteration 361 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:31:04,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:31:05,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:31:05,035][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:31:05,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:31:05,059][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:31:05,060][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:31:05,820][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:31:06,284][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:31:06,794][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:31:07,302][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:31:07,809][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:31:08,313][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:31:08,818][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:31:09,327][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:31:09,832][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:31:10,337][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:31:10,841][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:31:11,360][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:31:11,865][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:31:12,376][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:31:12,878][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:31:13,381][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:31:13,882][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:31:14,386][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:31:14,893][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:31:15,396][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:31:15,902][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:31:16,430][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:31:16,935][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:31:17,437][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:31:17,956][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:31:18,461][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:31:18,961][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:31:19,464][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:31:19,972][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:31:20,475][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:31:20,977][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:31:21,482][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:31:21,987][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:31:22,494][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:31:23,003][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:31:23,508][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:31:24,014][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:31:24,519][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:31:25,024][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:31:25,529][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:31:26,039][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:31:26,547][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:31:27,056][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:31:27,565][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:31:28,071][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:31:28,577][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:31:29,082][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:31:29,591][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:31:30,093][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:31:30,598][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:31:31,106][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:31:31,608][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:31:32,114][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:31:32,614][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:31:33,118][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:31:33,623][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:31:34,129][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:31:34,630][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:31:35,133][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:31:35,633][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:31:36,134][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:31:36,637][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:31:37,136][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:31:37,654][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:31:38,154][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10292 tokens. [2025-11-13 03:31:38,943][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:33 [2025-11-13 03:31:39,718][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:31:39,719][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:31:39,721][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:31:40,697][__main__][INFO] - Iteration 362 took 54s (33.15% Gen, 65.06% Train). Generation: 18s, Training: 35s. Estimated remaining time: 40h 9m 42s. Estimated total time: 45h 31m 41s. Time estimates for 10 more iterations: 9m 6s, 100 more iterations: 1h 31m 3s, 500 more iterations: 7h 35m 16s. [2025-11-13 03:31:40,699][__main__][INFO] - Starting iteration 362. [2025-11-13 03:31:41,171][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 03:31:41,172][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:31:46,345][mllm.models.large_language_model_local][WARNING] - Response .Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:31:56,115][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:31:57,901][__main__][INFO] - Number of regex retries in iteration 362: 2 [2025-11-13 03:31:57,902][__main__][INFO] - agents played in iteration 362 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:31:58,731][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:31:58,759][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:31:58,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:31:58,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:31:58,810][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:31:58,810][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:31:59,528][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:31:59,992][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:32:00,505][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:32:01,009][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:32:01,522][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:32:02,031][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:32:02,541][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:32:03,045][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:32:03,551][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:32:04,063][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:32:04,568][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:32:05,090][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:32:05,593][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:32:06,094][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:32:06,601][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:32:07,102][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:32:07,610][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:32:08,111][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:32:08,616][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:32:09,122][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:32:09,628][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:32:10,129][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:32:10,635][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:32:11,136][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:32:11,641][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:32:12,144][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:32:12,647][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:32:13,170][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:32:13,677][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:32:14,189][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:32:14,693][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:32:15,199][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:32:15,735][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:32:16,241][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:32:16,745][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:32:17,250][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:32:17,755][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:32:18,265][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:32:18,770][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:32:19,277][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:32:19,790][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:32:20,297][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:32:20,804][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:32:21,306][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:32:21,809][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:32:22,321][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:32:22,823][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:32:23,325][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:32:23,831][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:32:24,333][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:32:24,839][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:32:25,343][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:32:25,842][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:32:26,351][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:32:26,856][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:32:27,356][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:32:27,858][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:32:28,365][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:32:28,873][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:32:29,379][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:32:29,883][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:32:30,386][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:32:30,894][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:32:31,410][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:32:31,915][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10341 tokens. [2025-11-13 03:32:32,647][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.26%, Current % of VRAM taken: 58.50%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:33 [2025-11-13 03:32:33,427][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:32:33,429][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:32:33,431][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:32:34,403][__main__][INFO] - Iteration 363 took 53s (31.43% Gen, 66.74% Train). Generation: 16s, Training: 35s. Estimated remaining time: 38h 58m 44s. Estimated total time: 44h 21m 37s. Time estimates for 10 more iterations: 8m 52s, 100 more iterations: 1h 28m 43s, 500 more iterations: 7h 23m 36s. [2025-11-13 03:32:34,405][__main__][INFO] - Starting iteration 363. [2025-11-13 03:32:34,887][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 03:32:34,887][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:32:54,094][__main__][INFO] - Number of regex retries in iteration 363: 0 [2025-11-13 03:32:54,095][__main__][INFO] - agents played in iteration 363 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:32:54,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:32:54,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:32:55,008][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:32:55,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:32:55,034][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:32:55,035][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:32:55,817][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:32:56,285][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:32:56,796][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:32:57,305][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:32:57,810][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:32:58,317][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:32:58,826][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:32:59,332][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:32:59,835][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:33:00,338][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:33:00,842][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:33:01,350][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:33:01,851][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:33:02,354][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:33:02,858][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:33:03,362][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:33:03,862][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:33:04,366][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:33:04,870][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:33:05,384][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:33:05,889][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:33:06,393][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:33:06,905][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:33:07,405][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:33:07,915][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:33:08,416][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:33:08,922][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:33:09,430][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:33:09,936][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:33:10,442][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:33:10,944][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:33:11,447][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:33:11,953][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:33:12,461][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:33:12,969][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:33:13,489][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:33:13,998][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:33:14,514][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:33:15,021][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:33:15,528][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:33:16,035][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:33:16,538][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:33:17,043][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:33:17,545][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:33:18,044][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:33:18,545][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:33:19,052][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:33:19,554][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:33:20,060][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:33:20,564][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:33:21,068][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:33:21,573][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:33:22,077][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:33:22,583][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:33:23,084][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:33:23,590][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:33:24,094][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:33:24,596][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:33:25,111][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:33:25,613][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:33:26,116][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:33:26,623][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:33:27,126][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:33:27,632][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:33:28,139][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10359 tokens. [2025-11-13 03:33:28,898][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 03:33:29,672][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:33:29,674][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:33:29,676][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:33:30,637][__main__][INFO] - Iteration 364 took 55s (34.45% Gen, 63.82% Train). Generation: 19s, Training: 35s. Estimated remaining time: 41h 3m 43s. Estimated total time: 46h 27m 32s. Time estimates for 10 more iterations: 9m 17s, 100 more iterations: 1h 32m 55s, 500 more iterations: 7h 44m 35s. [2025-11-13 03:33:30,639][__main__][INFO] - Starting iteration 364. [2025-11-13 03:33:31,113][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 03:33:31,114][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:33:49,714][__main__][INFO] - Number of regex retries in iteration 364: 0 [2025-11-13 03:33:49,714][__main__][INFO] - agents played in iteration 364 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:33:50,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:33:50,592][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:33:50,618][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:33:50,641][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:33:50,642][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:33:50,643][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:33:51,431][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:33:51,891][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:33:52,400][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:33:52,908][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:33:53,427][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:33:53,937][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:33:54,441][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:33:54,944][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:33:55,450][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:33:55,954][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:33:56,460][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:33:56,960][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:33:57,464][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:33:57,974][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:33:58,482][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:33:58,987][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:33:59,500][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:34:00,001][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:34:00,514][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:34:01,017][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:34:01,527][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:34:02,035][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:34:02,538][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:34:03,039][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:34:03,546][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:34:04,047][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:34:04,552][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:34:05,069][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:34:05,578][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:34:06,099][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:34:06,607][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:34:07,123][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:34:07,624][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:34:08,125][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:34:08,637][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:34:09,141][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:34:09,644][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:34:10,156][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:34:10,661][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:34:11,171][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:34:11,671][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:34:12,173][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:34:12,679][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:34:13,183][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:34:13,682][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:34:14,183][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:34:14,684][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:34:15,191][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:34:15,693][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:34:16,195][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:34:16,707][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:34:17,211][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:34:17,735][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:34:18,236][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:34:18,740][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:34:19,240][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:34:19,741][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:34:20,242][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:34:20,740][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:34:21,239][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:34:21,745][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:34:22,243][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:34:22,742][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:34:23,251][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:34:23,753][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10313 tokens. [2025-11-13 03:34:24,484][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.07%, ΔTime: 00:00:33 [2025-11-13 03:34:25,252][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:34:25,254][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:34:25,255][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:34:26,230][__main__][INFO] - Iteration 365 took 55s (33.75% Gen, 64.48% Train). Generation: 18s, Training: 35s. Estimated remaining time: 40h 31m 10s. Estimated total time: 45h 55m 55s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 51s, 500 more iterations: 7h 39m 19s. [2025-11-13 03:34:26,233][__main__][INFO] - Starting iteration 365. [2025-11-13 03:34:26,698][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 03:34:26,699][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:34:32,661][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:34:44,510][__main__][INFO] - Number of regex retries in iteration 365: 1 [2025-11-13 03:34:44,510][__main__][INFO] - agents played in iteration 365 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:34:45,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:34:45,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:34:45,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:34:45,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:34:45,482][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:34:45,482][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:34:46,232][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:34:46,708][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:34:47,219][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:34:47,728][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:34:48,236][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:34:48,740][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:34:49,246][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:34:49,754][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:34:50,260][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:34:50,766][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:34:51,267][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:34:51,767][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:34:52,270][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:34:52,775][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:34:53,280][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:34:53,787][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:34:54,293][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:34:54,809][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:34:55,315][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:34:55,831][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:34:56,338][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:34:56,839][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:34:57,350][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:34:57,855][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:34:58,360][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:34:58,866][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:34:59,375][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:34:59,885][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:35:00,386][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:35:00,891][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:35:01,396][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:35:01,899][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:35:02,406][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:35:02,908][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:35:03,414][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:35:03,923][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:35:04,431][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:35:04,940][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:35:05,446][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:35:05,953][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:35:06,458][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:35:06,959][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:35:07,463][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:35:07,968][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:35:08,471][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:35:08,980][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:35:09,482][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:35:09,984][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:35:10,485][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:35:10,988][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:35:11,492][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:35:11,999][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:35:12,500][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:35:13,002][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:35:13,506][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:35:14,006][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:35:14,507][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:35:15,010][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:35:15,533][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:35:16,036][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:35:16,539][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:35:17,041][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:35:17,542][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:35:18,049][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:35:18,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10315 tokens. [2025-11-13 03:35:19,272][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 03:35:20,036][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:35:20,038][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:35:20,040][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:35:21,047][__main__][INFO] - Iteration 366 took 54s (32.77% Gen, 65.37% Train). Generation: 17s, Training: 35s. Estimated remaining time: 39h 51m 49s. Estimated total time: 45h 17m 28s. Time estimates for 10 more iterations: 9m 3s, 100 more iterations: 1h 30m 34s, 500 more iterations: 7h 32m 54s. [2025-11-13 03:35:21,049][__main__][INFO] - Starting iteration 366. [2025-11-13 03:35:21,515][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 03:35:21,516][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:35:38,392][__main__][INFO] - Number of regex retries in iteration 366: 0 [2025-11-13 03:35:38,392][__main__][INFO] - agents played in iteration 366 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:35:39,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:35:39,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:35:39,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:35:39,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:35:39,308][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:35:39,309][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:35:40,094][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:35:40,562][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:35:41,080][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:35:41,586][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:35:42,094][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:35:42,596][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:35:43,102][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:35:43,603][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:35:44,109][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:35:44,614][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:35:45,124][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:35:45,628][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:35:46,138][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:35:46,642][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:35:47,158][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:35:47,660][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:35:48,164][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:35:48,673][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:35:49,180][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:35:49,692][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:35:50,205][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:35:50,713][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:35:51,240][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:35:51,747][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:35:52,257][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:35:52,760][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:35:53,265][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:35:53,771][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:35:54,274][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:35:54,777][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:35:55,282][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:35:55,785][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:35:56,292][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:35:56,797][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:35:57,298][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:35:57,812][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:35:58,313][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:35:58,832][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:35:59,335][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:35:59,839][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:36:00,340][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:36:00,844][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:36:01,351][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:36:01,854][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:36:02,360][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:36:02,863][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:36:03,367][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:36:03,871][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:36:04,376][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:36:04,882][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:36:05,388][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:36:05,900][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:36:06,402][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:36:06,908][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:36:07,412][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:36:07,928][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:36:08,430][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:36:08,937][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:36:09,452][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:36:09,953][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:36:10,457][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:36:10,958][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:36:11,465][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:36:11,970][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:36:12,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10278 tokens. [2025-11-13 03:36:13,229][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 03:36:14,022][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:36:14,024][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:36:14,026][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:36:15,062][__main__][INFO] - Iteration 367 took 53s (31.52% Gen, 66.55% Train). Generation: 16s, Training: 35s. Estimated remaining time: 39h 10m 49s. Estimated total time: 44h 37m 22s. Time estimates for 10 more iterations: 8m 55s, 100 more iterations: 1h 29m 14s, 500 more iterations: 7h 26m 13s. [2025-11-13 03:36:15,066][__main__][INFO] - Starting iteration 367. [2025-11-13 03:36:15,576][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 03:36:15,577][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:36:34,083][__main__][INFO] - Number of regex retries in iteration 367: 0 [2025-11-13 03:36:34,084][__main__][INFO] - agents played in iteration 367 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:36:34,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:36:35,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:36:35,036][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:36:35,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:36:35,061][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:36:35,061][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:36:35,807][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:36:36,266][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:36:36,779][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:36:37,282][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:36:37,786][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:36:38,290][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:36:38,799][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:36:39,312][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:36:39,817][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:36:40,325][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:36:40,834][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:36:41,340][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:36:41,845][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:36:42,347][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:36:42,852][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:36:43,361][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:36:43,866][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:36:44,369][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:36:44,873][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:36:45,382][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:36:45,886][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:36:46,397][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:36:46,902][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:36:47,410][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:36:47,917][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:36:48,420][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:36:48,922][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:36:49,422][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:36:49,937][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:36:50,438][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:36:50,942][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:36:51,441][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:36:51,946][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:36:52,457][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:36:52,962][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:36:53,467][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:36:53,969][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:36:54,473][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:36:54,978][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:36:55,484][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:36:55,991][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:36:56,499][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:36:57,001][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:36:57,503][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:36:58,018][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:36:58,520][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:36:59,025][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:36:59,530][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:37:00,036][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:37:00,558][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:37:01,062][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:37:01,567][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:37:02,071][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:37:02,576][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:37:03,082][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:37:03,586][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:37:04,093][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:37:04,599][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:37:05,099][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:37:05,600][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:37:06,099][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:37:06,603][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:37:07,106][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:37:07,612][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:37:08,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10294 tokens. [2025-11-13 03:37:08,896][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.03%, Current % of VRAM taken: 58.28%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 03:37:09,712][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:37:09,714][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:37:09,716][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:37:10,573][__main__][INFO] - Iteration 368 took 54s (33.65% Gen, 64.79% Train). Generation: 18s, Training: 35s. Estimated remaining time: 40h 22m 23s. Estimated total time: 45h 49m 52s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 39s, 500 more iterations: 7h 38m 18s. [2025-11-13 03:37:10,575][__main__][INFO] - Starting iteration 368. [2025-11-13 03:37:11,055][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 03:37:11,055][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:37:28,528][__main__][INFO] - Number of regex retries in iteration 368: 0 [2025-11-13 03:37:28,528][__main__][INFO] - agents played in iteration 368 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:37:29,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:37:29,469][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:37:29,495][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:37:29,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:37:29,519][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:37:29,520][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:37:30,279][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:37:30,754][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:37:31,267][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:37:31,773][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:37:32,276][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:37:32,778][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:37:33,281][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:37:33,790][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:37:34,293][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:37:34,801][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:37:35,306][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:37:35,810][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:37:36,316][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:37:36,821][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:37:37,324][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:37:37,829][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:37:38,335][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:37:38,841][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:37:39,351][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:37:39,866][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:37:40,372][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:37:40,877][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:37:41,382][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:37:41,889][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:37:42,392][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:37:42,920][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:37:43,424][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:37:43,931][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:37:44,435][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:37:44,938][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:37:45,458][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:37:45,961][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:37:46,464][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:37:46,973][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:37:47,477][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:37:47,982][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:37:48,483][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:37:48,987][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:37:49,491][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:37:49,996][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:37:50,498][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:37:51,005][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:37:51,513][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:37:52,017][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:37:52,522][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:37:53,028][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:37:53,538][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:37:54,047][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:37:54,561][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:37:55,070][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:37:55,574][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:37:56,084][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:37:56,585][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:37:57,093][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:37:57,595][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:37:58,101][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:37:58,604][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:37:59,109][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:37:59,613][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:38:00,113][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:38:00,614][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:38:01,137][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:38:01,645][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:38:02,152][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:38:02,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10275 tokens. [2025-11-13 03:38:03,403][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 03:38:04,185][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:38:04,186][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:38:04,189][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:38:05,147][__main__][INFO] - Iteration 369 took 54s (32.30% Gen, 65.92% Train). Generation: 17s, Training: 35s. Estimated remaining time: 39h 36m 16s. Estimated total time: 45h 4m 39s. Time estimates for 10 more iterations: 9m 0s, 100 more iterations: 1h 30m 9s, 500 more iterations: 7h 30m 46s. [2025-11-13 03:38:05,149][__main__][INFO] - Starting iteration 369. [2025-11-13 03:38:05,637][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 03:38:05,637][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:38:11,846][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:38:15,712][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:38:25,541][__main__][INFO] - Number of regex retries in iteration 369: 2 [2025-11-13 03:38:25,542][__main__][INFO] - agents played in iteration 369 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:38:26,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:38:26,459][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:38:26,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:38:26,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:38:26,515][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:38:26,516][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:38:27,288][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:38:27,771][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:38:28,280][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:38:28,785][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:38:29,290][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:38:29,798][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:38:30,302][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:38:30,805][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:38:31,316][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:38:31,827][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:38:32,337][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:38:32,841][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:38:33,344][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:38:33,854][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:38:34,356][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:38:34,857][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:38:35,360][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:38:35,862][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:38:36,376][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:38:36,901][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:38:37,406][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:38:37,908][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:38:38,413][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:38:38,921][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:38:39,422][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:38:39,922][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:38:40,438][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:38:40,939][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:38:41,440][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:38:41,943][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:38:42,444][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:38:42,964][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:38:43,469][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:38:43,975][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:38:44,482][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:38:44,984][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:38:45,489][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:38:45,997][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:38:46,502][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:38:47,004][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:38:47,505][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:38:48,012][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:38:48,521][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:38:49,026][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:38:49,547][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:38:50,051][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:38:50,554][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:38:51,055][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:38:51,558][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:38:52,057][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:38:52,557][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:38:53,057][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:38:53,561][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:38:54,060][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:38:54,559][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:38:55,063][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:38:55,565][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:38:56,076][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:38:56,578][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:38:57,087][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:38:57,594][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:38:58,102][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:38:58,608][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:38:59,116][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:38:59,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10229 tokens. [2025-11-13 03:39:00,419][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.14%, ΔTime: 00:00:33 [2025-11-13 03:39:01,205][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:39:01,207][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:39:01,208][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:39:02,105][__main__][INFO] - Iteration 370 took 56s (35.25% Gen, 63.16% Train). Generation: 19s, Training: 35s. Estimated remaining time: 41h 34m 6s. Estimated total time: 47h 3m 26s. Time estimates for 10 more iterations: 9m 24s, 100 more iterations: 1h 34m 6s, 500 more iterations: 7h 50m 34s. [2025-11-13 03:39:02,108][__main__][INFO] - Starting iteration 370. [2025-11-13 03:39:02,589][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 36 and human policies 1. [2025-11-13 03:39:02,590][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:39:09,383][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:39:12,151][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:39:22,359][__main__][INFO] - Number of regex retries in iteration 370: 2 [2025-11-13 03:39:22,359][__main__][INFO] - agents played in iteration 370 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:39:23,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:39:23,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:39:23,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:39:23,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:39:23,280][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:39:23,281][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:39:24,056][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:39:24,515][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:39:25,050][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:39:25,555][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:39:26,062][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:39:26,568][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:39:27,072][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:39:27,582][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:39:28,093][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:39:28,598][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:39:29,103][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:39:29,618][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:39:30,122][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:39:30,628][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:39:31,131][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:39:31,638][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:39:32,143][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:39:32,649][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:39:33,155][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:39:33,655][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:39:34,171][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:39:34,677][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:39:35,180][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:39:35,684][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:39:36,185][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:39:36,697][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:39:37,200][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:39:37,702][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:39:38,210][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:39:38,715][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:39:39,218][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:39:39,729][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:39:40,232][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:39:40,757][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:39:41,260][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:39:41,764][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:39:42,269][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:39:42,775][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:39:43,284][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:39:43,792][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:39:44,295][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:39:44,796][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:39:45,295][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:39:45,799][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:39:46,298][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:39:46,802][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:39:47,301][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:39:47,800][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:39:48,300][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:39:48,799][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:39:49,299][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:39:49,800][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:39:50,301][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:39:50,804][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:39:51,304][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:39:51,803][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:39:52,303][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:39:52,802][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:39:53,304][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:39:53,812][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:39:54,315][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:39:54,821][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:39:55,327][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:39:55,830][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:39:56,357][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10273 tokens. [2025-11-13 03:39:57,175][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.03%, Current % of VRAM taken: 58.28%, Block Peak % of device VRAM: 62.22%, ΔTime: 00:00:33 [2025-11-13 03:39:57,950][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:39:57,952][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:39:57,954][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:39:59,697][__main__][INFO] - Iteration 371 took 57s (34.62% Gen, 62.33% Train). Generation: 19s, Training: 35s. Estimated remaining time: 42h 5m 8s. Estimated total time: 47h 35m 26s. Time estimates for 10 more iterations: 9m 31s, 100 more iterations: 1h 35m 10s, 500 more iterations: 7h 55m 54s. [2025-11-13 03:39:59,699][__main__][INFO] - Starting iteration 371. [2025-11-13 03:40:00,179][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 03:40:00,179][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:40:04,688][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:40:04,770][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:40:09,686][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:40:17,752][__main__][INFO] - Number of regex retries in iteration 371: 3 [2025-11-13 03:40:17,752][__main__][INFO] - agents played in iteration 371 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:40:18,641][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:40:18,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:40:18,689][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:40:18,712][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:40:18,712][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:40:18,713][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:40:19,497][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:40:19,964][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:40:20,472][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:40:20,981][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:40:21,489][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:40:22,001][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:40:22,505][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:40:23,009][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:40:23,516][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:40:24,018][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:40:24,523][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:40:25,025][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:40:25,534][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:40:26,042][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:40:26,560][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:40:27,064][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:40:27,568][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:40:28,083][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:40:28,589][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:40:29,121][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:40:29,627][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:40:30,135][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:40:30,641][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:40:31,145][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:40:31,650][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:40:32,152][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:40:32,656][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:40:33,176][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:40:33,684][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:40:34,190][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:40:34,697][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:40:35,202][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:40:35,714][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:40:36,220][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:40:36,725][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:40:37,236][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:40:37,740][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:40:38,247][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:40:38,752][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:40:39,253][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:40:39,759][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:40:40,263][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:40:40,766][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:40:41,265][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:40:41,765][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:40:42,277][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:40:42,777][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:40:43,278][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:40:43,781][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:40:44,284][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:40:44,784][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:40:45,291][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:40:45,793][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:40:46,298][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:40:46,804][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:40:47,306][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:40:47,816][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:40:48,317][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:40:48,817][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:40:49,321][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:40:49,826][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:40:50,328][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:40:50,830][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:40:51,333][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:40:51,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10289 tokens. [2025-11-13 03:40:52,630][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 03:40:53,404][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:40:53,407][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:40:53,409][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:40:54,362][__main__][INFO] - Iteration 372 took 54s (32.43% Gen, 65.81% Train). Generation: 17s, Training: 35s. Estimated remaining time: 39h 37m 59s. Estimated total time: 45h 9m 11s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 18s, 500 more iterations: 7h 31m 31s. [2025-11-13 03:40:54,364][__main__][INFO] - Starting iteration 372. [2025-11-13 03:40:54,829][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 03:40:54,829][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:41:13,870][__main__][INFO] - Number of regex retries in iteration 372: 0 [2025-11-13 03:41:13,871][__main__][INFO] - agents played in iteration 372 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:41:14,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:41:14,730][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:41:14,752][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:41:14,774][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:41:14,775][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:41:14,776][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:41:15,520][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:41:15,981][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:41:16,490][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:41:16,992][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:41:17,502][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:41:18,010][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:41:18,512][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:41:19,018][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:41:19,523][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:41:20,023][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:41:20,525][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:41:21,033][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:41:21,538][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:41:22,041][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:41:22,546][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:41:23,046][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:41:23,550][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:41:24,050][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:41:24,557][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:41:25,057][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:41:25,560][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:41:26,075][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:41:26,578][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:41:27,092][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:41:27,595][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:41:28,095][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:41:28,603][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:41:29,104][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:41:29,610][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:41:30,108][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:41:30,611][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:41:31,111][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:41:31,613][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:41:32,122][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:41:32,647][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:41:33,155][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:41:33,657][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:41:34,160][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:41:34,667][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:41:35,173][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:41:35,676][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:41:36,177][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:41:36,679][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:41:37,182][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:41:37,693][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:41:38,192][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:41:38,693][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:41:39,197][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:41:39,702][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:41:40,204][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:41:40,710][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:41:41,213][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:41:41,719][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:41:42,225][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:41:42,726][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:41:43,231][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:41:43,739][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:41:44,241][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:41:44,745][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:41:45,254][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:41:45,759][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:41:46,260][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:41:46,763][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:41:47,269][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:41:47,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10264 tokens. [2025-11-13 03:41:48,545][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 62.46%, ΔTime: 00:00:33 [2025-11-13 03:41:49,328][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:41:49,330][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:41:49,332][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:41:50,247][__main__][INFO] - Iteration 373 took 55s (34.36% Gen, 63.99% Train). Generation: 19s, Training: 35s. Estimated remaining time: 40h 38m 47s. Estimated total time: 46h 10m 55s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 21s, 500 more iterations: 7h 41m 49s. [2025-11-13 03:41:50,249][__main__][INFO] - Starting iteration 373. [2025-11-13 03:41:50,737][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 03:41:50,738][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:41:56,039][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:42:07,961][__main__][INFO] - Number of regex retries in iteration 373: 1 [2025-11-13 03:42:07,962][__main__][INFO] - agents played in iteration 373 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:42:08,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:42:08,781][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:42:08,804][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:42:08,826][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:42:08,826][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:42:08,827][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:42:09,576][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:42:10,037][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:42:10,544][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:42:11,046][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:42:11,553][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:42:12,060][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:42:12,572][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:42:13,079][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:42:13,588][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:42:14,093][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:42:14,606][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:42:15,109][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:42:15,617][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:42:16,123][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:42:16,629][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:42:17,131][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:42:17,638][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:42:18,141][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:42:18,657][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:42:19,160][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:42:19,663][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:42:20,182][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:42:20,685][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:42:21,201][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:42:21,702][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:42:22,211][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:42:22,718][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:42:23,219][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:42:23,723][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:42:24,226][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:42:24,734][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:42:25,241][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:42:25,746][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:42:26,248][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:42:26,752][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:42:27,256][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:42:27,760][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:42:28,264][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:42:28,768][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:42:29,279][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:42:29,780][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:42:30,282][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:42:30,792][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:42:31,297][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:42:31,812][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:42:32,315][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:42:32,819][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:42:33,323][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:42:33,827][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:42:34,328][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:42:34,833][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:42:35,336][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:42:35,845][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:42:36,353][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:42:36,856][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:42:37,363][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:42:37,870][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:42:38,389][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:42:38,895][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:42:39,397][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:42:39,900][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:42:40,403][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:42:40,912][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:42:41,418][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:42:41,919][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10380 tokens. [2025-11-13 03:42:42,674][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.07%, Current % of VRAM taken: 58.32%, Block Peak % of device VRAM: 62.33%, ΔTime: 00:00:33 [2025-11-13 03:42:43,467][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:42:43,469][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:42:43,471][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:42:44,447][__main__][INFO] - Iteration 374 took 53s (32.07% Gen, 66.11% Train). Generation: 17s, Training: 35s. Estimated remaining time: 39h 12m 27s. Estimated total time: 44h 45m 30s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 31s, 500 more iterations: 7h 27m 35s. [2025-11-13 03:42:44,449][__main__][INFO] - Starting iteration 374. [2025-11-13 03:42:44,939][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 03:42:44,940][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:42:49,466][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:43:02,170][__main__][INFO] - Number of regex retries in iteration 374: 1 [2025-11-13 03:43:02,170][__main__][INFO] - agents played in iteration 374 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:43:03,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:43:03,036][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:43:03,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:43:03,081][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:43:03,082][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:43:03,082][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:43:03,849][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:43:04,308][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:43:04,817][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:43:05,324][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:43:05,826][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:43:06,328][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:43:06,831][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:43:07,339][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:43:07,843][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:43:08,345][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:43:08,847][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:43:09,352][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:43:09,864][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:43:10,366][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:43:10,877][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:43:11,379][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:43:11,882][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:43:12,384][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:43:12,890][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:43:13,393][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:43:13,894][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:43:14,397][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:43:14,912][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:43:15,416][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:43:15,922][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:43:16,428][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:43:16,940][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:43:17,452][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:43:17,957][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:43:18,462][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:43:18,985][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:43:19,496][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:43:20,002][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:43:20,503][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:43:21,003][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:43:21,512][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:43:22,011][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:43:22,516][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:43:23,015][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:43:23,515][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:43:24,017][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:43:24,520][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:43:25,018][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:43:25,521][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:43:26,021][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:43:26,523][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:43:27,025][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:43:27,525][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:43:28,025][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:43:28,528][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:43:29,030][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:43:29,536][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:43:30,040][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:43:30,546][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:43:31,051][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:43:31,556][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:43:32,061][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:43:32,566][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:43:33,067][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:43:33,573][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:43:34,078][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:43:34,585][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:43:35,087][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:43:35,592][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:43:36,110][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10160 tokens. [2025-11-13 03:43:36,862][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 03:43:37,643][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:43:37,645][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:43:37,646][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:43:38,581][__main__][INFO] - Iteration 375 took 53s (32.12% Gen, 66.13% Train). Generation: 17s, Training: 35s. Estimated remaining time: 39h 8m 10s. Estimated total time: 44h 42m 7s. Time estimates for 10 more iterations: 8m 56s, 100 more iterations: 1h 29m 24s, 500 more iterations: 7h 27m 1s. [2025-11-13 03:43:38,583][__main__][INFO] - Starting iteration 375. [2025-11-13 03:43:39,078][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 03:43:39,078][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:43:43,604][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:43:56,339][__main__][INFO] - Number of regex retries in iteration 375: 1 [2025-11-13 03:43:56,339][__main__][INFO] - agents played in iteration 375 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:43:57,191][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:43:57,214][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:43:57,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:43:57,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:43:57,260][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:43:57,261][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:43:58,032][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:43:58,501][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:43:59,010][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:43:59,530][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:44:00,041][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:44:00,543][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:44:01,045][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:44:01,549][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:44:02,052][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:44:02,564][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:44:03,072][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:44:03,578][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:44:04,081][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:44:04,582][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:44:05,094][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:44:05,601][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:44:06,106][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:44:06,608][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:44:07,110][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:44:07,615][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:44:08,143][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:44:08,649][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:44:09,153][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:44:09,658][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:44:10,164][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:44:10,664][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:44:11,164][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:44:11,665][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:44:12,168][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:44:12,672][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:44:13,177][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:44:13,679][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:44:14,186][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:44:14,695][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:44:15,199][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:44:15,699][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:44:16,199][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:44:16,701][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:44:17,201][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:44:17,704][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:44:18,202][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:44:18,705][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:44:19,215][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:44:19,720][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:44:20,220][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:44:20,727][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:44:21,230][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:44:21,740][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:44:22,247][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:44:22,751][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:44:23,259][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:44:23,763][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:44:24,271][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:44:24,776][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:44:25,283][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:44:25,786][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:44:26,300][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:44:26,806][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:44:27,323][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:44:27,827][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:44:28,341][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:44:28,845][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:44:29,348][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:44:29,851][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:44:30,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10251 tokens. [2025-11-13 03:44:31,113][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.22%, ΔTime: 00:00:33 [2025-11-13 03:44:31,884][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:44:31,886][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:44:31,887][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:44:32,903][__main__][INFO] - Iteration 376 took 53s (32.07% Gen, 66.04% Train). Generation: 17s, Training: 35s. Estimated remaining time: 39h 16m 26s. Estimated total time: 44h 51m 17s. Time estimates for 10 more iterations: 8m 58s, 100 more iterations: 1h 29m 42s, 500 more iterations: 7h 28m 32s. [2025-11-13 03:44:32,905][__main__][INFO] - Starting iteration 376. [2025-11-13 03:44:33,371][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 03:44:33,371][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:44:37,791][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:44:37,855][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:44:37,970][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:44:49,217][__main__][INFO] - Number of regex retries in iteration 376: 3 [2025-11-13 03:44:49,218][__main__][INFO] - agents played in iteration 376 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:44:50,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:44:50,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:44:50,069][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:44:50,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:44:50,093][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:44:50,093][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:44:50,856][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:44:51,320][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:44:51,833][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:44:52,342][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:44:52,846][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:44:53,363][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:44:53,873][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:44:54,377][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:44:54,884][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:44:55,391][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:44:55,892][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:44:56,395][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:44:56,898][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:44:57,402][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:44:57,902][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:44:58,402][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:44:58,905][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:44:59,404][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:44:59,909][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:45:00,409][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:45:00,909][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:45:01,419][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:45:01,922][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:45:02,422][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:45:02,925][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:45:03,426][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:45:03,930][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:45:04,433][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:45:04,935][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:45:05,454][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:45:05,958][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:45:06,461][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:45:06,974][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:45:07,478][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:45:07,985][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:45:08,486][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:45:08,992][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:45:09,495][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:45:09,997][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:45:10,501][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:45:11,004][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:45:11,506][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:45:12,013][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:45:12,514][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:45:13,015][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:45:13,530][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:45:14,033][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:45:14,544][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:45:15,043][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:45:15,545][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:45:16,048][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:45:16,550][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:45:17,055][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:45:17,557][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:45:18,059][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:45:18,565][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:45:19,066][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:45:19,571][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:45:20,076][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:45:20,580][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:45:21,087][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:45:21,590][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:45:22,094][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:45:22,606][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:45:23,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10161 tokens. [2025-11-13 03:45:23,869][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 03:45:24,640][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:45:24,641][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:45:24,643][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:45:25,593][__main__][INFO] - Iteration 377 took 52s (30.34% Gen, 67.83% Train). Generation: 15s, Training: 35s. Estimated remaining time: 37h 55m 23s. Estimated total time: 43h 31m 7s. Time estimates for 10 more iterations: 8m 42s, 100 more iterations: 1h 27m 2s, 500 more iterations: 7h 15m 11s. [2025-11-13 03:45:25,595][__main__][INFO] - Starting iteration 377. [2025-11-13 03:45:26,075][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 03:45:26,076][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:45:42,404][__main__][INFO] - Number of regex retries in iteration 377: 0 [2025-11-13 03:45:42,405][__main__][INFO] - agents played in iteration 377 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:45:43,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:45:43,297][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:45:43,320][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:45:43,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:45:43,343][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:45:43,344][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:45:44,109][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:45:44,572][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:45:45,082][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:45:45,589][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:45:46,101][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:45:46,606][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:45:47,112][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:45:47,626][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:45:48,129][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:45:48,638][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:45:49,138][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:45:49,640][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:45:50,163][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:45:50,666][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:45:51,168][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:45:51,674][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:45:52,177][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:45:52,681][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:45:53,181][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:45:53,683][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:45:54,184][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:45:54,687][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:45:55,188][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:45:55,691][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:45:56,196][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:45:56,701][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:45:57,200][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:45:57,700][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:45:58,200][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:45:58,701][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:45:59,203][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:45:59,703][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:46:00,202][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:46:00,704][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:46:01,207][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:46:01,711][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:46:02,216][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:46:02,716][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:46:03,222][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:46:03,722][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:46:04,221][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:46:04,721][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:46:05,221][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:46:05,734][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:46:06,233][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:46:06,733][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:46:07,236][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:46:07,735][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:46:08,242][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:46:08,749][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:46:09,247][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:46:09,759][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:46:10,262][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:46:10,768][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:46:11,276][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:46:11,784][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:46:12,284][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:46:12,788][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:46:13,290][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:46:13,793][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:46:14,302][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:46:14,806][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:46:15,312][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:46:15,817][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:46:16,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9987 tokens. [2025-11-13 03:46:17,070][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 62.12%, ΔTime: 00:00:32 [2025-11-13 03:46:17,851][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:46:17,852][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:46:17,854][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:46:18,801][__main__][INFO] - Iteration 378 took 52s (30.97% Gen, 67.23% Train). Generation: 16s, Training: 35s. Estimated remaining time: 38h 19m 41s. Estimated total time: 43h 56m 18s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 52s, 500 more iterations: 7h 19m 23s. [2025-11-13 03:46:18,803][__main__][INFO] - Starting iteration 378. [2025-11-13 03:46:19,276][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 03:46:19,277][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:46:35,360][__main__][INFO] - Number of regex retries in iteration 378: 0 [2025-11-13 03:46:35,361][__main__][INFO] - agents played in iteration 378 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:46:36,240][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:46:36,269][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:46:36,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:46:36,319][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:46:36,320][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:46:36,321][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:46:37,048][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:46:37,523][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:46:38,031][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:46:38,537][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:46:39,049][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:46:39,554][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:46:40,066][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:46:40,568][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:46:41,070][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:46:41,584][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:46:42,093][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:46:42,597][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:46:43,099][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:46:43,600][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:46:44,105][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:46:44,610][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:46:45,112][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:46:45,615][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:46:46,120][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:46:46,621][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:46:47,122][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:46:47,623][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:46:48,126][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:46:48,626][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:46:49,129][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:46:49,646][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:46:50,146][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:46:50,664][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:46:51,166][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:46:51,667][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:46:52,171][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:46:52,674][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:46:53,179][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:46:53,680][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:46:54,184][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:46:54,691][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:46:55,193][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:46:55,695][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:46:56,198][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:46:56,702][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:46:57,206][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:46:57,714][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:46:58,225][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:46:58,731][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:46:59,231][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:46:59,734][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:47:00,249][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:47:00,749][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:47:01,262][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:47:01,762][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:47:02,261][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:47:02,777][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:47:03,281][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:47:03,786][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:47:04,292][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:47:04,798][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:47:05,301][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:47:05,805][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:47:06,314][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:47:06,818][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:47:07,326][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:47:07,831][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:47:08,335][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:47:08,840][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:47:09,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10166 tokens. [2025-11-13 03:47:10,113][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.19%, ΔTime: 00:00:33 [2025-11-13 03:47:10,884][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:47:10,885][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:47:10,887][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:47:11,857][__main__][INFO] - Iteration 379 took 52s (30.59% Gen, 67.56% Train). Generation: 16s, Training: 35s. Estimated remaining time: 38h 11m 33s. Estimated total time: 43h 49m 4s. Time estimates for 10 more iterations: 8m 45s, 100 more iterations: 1h 27m 38s, 500 more iterations: 7h 18m 10s. [2025-11-13 03:47:11,859][__main__][INFO] - Starting iteration 379. [2025-11-13 03:47:12,344][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 03:47:12,345][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:47:19,107][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:47:29,721][__main__][INFO] - Number of regex retries in iteration 379: 1 [2025-11-13 03:47:29,722][__main__][INFO] - agents played in iteration 379 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:47:30,559][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:47:30,586][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:47:30,614][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:47:30,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:47:30,638][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:47:30,639][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:47:31,419][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:47:31,882][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:47:32,389][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:47:32,897][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:47:33,399][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:47:33,903][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:47:34,407][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:47:34,913][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:47:35,415][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:47:35,916][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:47:36,418][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:47:36,920][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:47:37,422][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:47:37,924][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:47:38,426][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:47:38,924][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:47:39,423][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:47:39,925][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:47:40,423][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:47:40,924][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:47:41,422][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:47:41,920][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:47:42,420][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:47:42,919][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:47:43,424][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:47:43,928][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:47:44,427][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:47:44,930][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:47:45,432][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:47:45,939][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:47:46,443][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:47:46,943][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:47:47,457][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:47:47,957][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:47:48,459][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:47:48,973][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:47:49,474][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:47:49,985][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:47:50,485][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:47:50,987][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:47:51,495][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:47:51,999][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:47:52,502][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:47:53,004][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:47:53,504][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:47:54,003][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:47:54,500][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:47:54,999][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:47:55,501][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:47:56,004][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:47:56,508][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:47:57,011][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:47:57,519][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:47:58,025][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:47:58,531][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:47:59,038][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:47:59,543][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:48:00,045][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:48:00,551][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:48:01,058][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:48:01,561][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:48:02,067][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:48:02,571][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:48:03,075][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:48:03,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10198 tokens. [2025-11-13 03:48:04,438][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:33 [2025-11-13 03:48:05,215][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:48:05,217][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:48:05,219][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:48:06,334][__main__][INFO] - Iteration 380 took 53s (32.19% Gen, 65.75% Train). Generation: 17s, Training: 35s. Estimated remaining time: 39h 21m 4s. Estimated total time: 44h 59m 28s. Time estimates for 10 more iterations: 8m 59s, 100 more iterations: 1h 29m 58s, 500 more iterations: 7h 29m 54s. [2025-11-13 03:48:06,336][__main__][INFO] - Starting iteration 380. [2025-11-13 03:48:06,809][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 37 and human policies 1. [2025-11-13 03:48:06,810][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:48:24,619][__main__][INFO] - Number of regex retries in iteration 380: 0 [2025-11-13 03:48:24,620][__main__][INFO] - agents played in iteration 380 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:48:25,498][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:48:25,524][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:48:25,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:48:25,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:48:25,574][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:48:25,574][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:48:26,337][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:48:26,800][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:48:27,329][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:48:27,832][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:48:28,343][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:48:28,846][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:48:29,349][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:48:29,854][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:48:30,359][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:48:30,868][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:48:31,373][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:48:31,878][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:48:32,389][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:48:32,889][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:48:33,392][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:48:33,901][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:48:34,405][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:48:34,907][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:48:35,409][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:48:35,916][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:48:36,421][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:48:36,927][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:48:37,429][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:48:37,934][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:48:38,442][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:48:38,944][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:48:39,446][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:48:39,953][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:48:40,456][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:48:40,958][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:48:41,463][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:48:41,963][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:48:42,465][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:48:42,973][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:48:43,477][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:48:43,980][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:48:44,491][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:48:44,994][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:48:45,507][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:48:46,009][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:48:46,513][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:48:47,024][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:48:47,528][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:48:48,028][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:48:48,528][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:48:49,031][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:48:49,536][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:48:50,037][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:48:50,537][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:48:51,039][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:48:51,539][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:48:52,040][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:48:52,543][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:48:53,045][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:48:53,550][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:48:54,056][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:48:54,560][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:48:55,063][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:48:55,566][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:48:56,078][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:48:56,585][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:48:57,088][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:48:57,598][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:48:58,104][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:48:58,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10142 tokens. [2025-11-13 03:48:59,420][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.10%, ΔTime: 00:00:33 [2025-11-13 03:49:00,227][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:49:00,229][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:49:00,231][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:49:01,960][__main__][INFO] - Iteration 381 took 55s (32.29% Gen, 64.57% Train). Generation: 17s, Training: 35s. Estimated remaining time: 40h 18m 13s. Estimated total time: 45h 57m 33s. Time estimates for 10 more iterations: 9m 11s, 100 more iterations: 1h 31m 55s, 500 more iterations: 7h 39m 35s. [2025-11-13 03:49:01,962][__main__][INFO] - Starting iteration 381. [2025-11-13 03:49:02,441][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 03:49:02,441][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:49:33,794][__main__][INFO] - Number of regex retries in iteration 381: 0 [2025-11-13 03:49:33,795][__main__][INFO] - agents played in iteration 381 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:49:34,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:49:34,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:49:34,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:49:34,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:49:34,727][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:49:34,728][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:49:35,499][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:49:36,100][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:49:36,610][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:49:37,119][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:49:37,624][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:49:38,132][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:49:38,638][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:49:39,141][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:49:39,644][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:49:40,150][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:49:40,662][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:49:41,170][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:49:41,675][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:49:42,193][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:49:42,699][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:49:43,216][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:49:43,723][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:49:44,230][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:49:44,741][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:49:45,245][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:49:45,755][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:49:46,263][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:49:46,772][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:49:47,279][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:49:47,787][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:49:48,292][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:49:48,798][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:49:49,303][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:49:49,822][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:49:50,323][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:49:50,828][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:49:51,336][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:49:51,840][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:49:52,346][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:49:52,849][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:49:53,353][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:49:53,856][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:49:54,362][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:49:54,867][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:49:55,371][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:49:55,875][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:49:56,389][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:49:56,895][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:49:57,401][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:49:57,913][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:49:58,420][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:49:58,932][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:49:59,439][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:49:59,945][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:50:00,455][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:50:00,963][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:50:01,468][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:50:01,977][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:50:02,485][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:50:02,996][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:50:03,505][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:50:04,014][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:50:04,521][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:50:05,024][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:50:05,541][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:50:06,047][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:50:06,553][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:50:07,060][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:50:07,562][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:50:08,068][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10178 tokens. [2025-11-13 03:50:08,888][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.00%, Current % of VRAM taken: 58.25%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 03:50:09,549][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:50:09,551][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:50:09,553][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:50:10,341][__main__][INFO] - Iteration 382 took 1m 7s (46.18% Gen, 52.66% Train). Generation: 31s, Training: 35s. Estimated remaining time: 50h 54m 33s. Estimated total time: 56h 35m 2s. Time estimates for 10 more iterations: 11m 19s, 100 more iterations: 1h 53m 10s, 500 more iterations: 9h 25m 50s. [2025-11-13 03:50:10,343][__main__][INFO] - Starting iteration 382. [2025-11-13 03:50:10,835][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 03:50:10,836][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:50:24,156][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:50:36,402][__main__][INFO] - Number of regex retries in iteration 382: 1 [2025-11-13 03:50:36,402][__main__][INFO] - agents played in iteration 382 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:50:37,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:50:37,282][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:50:37,307][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:50:37,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:50:37,331][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:50:37,331][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:50:38,076][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:50:38,533][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:50:39,038][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:50:39,541][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:50:40,041][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:50:40,549][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:50:41,055][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:50:41,560][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:50:42,064][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:50:42,567][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:50:43,069][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:50:43,577][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:50:44,080][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:50:44,583][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:50:45,086][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:50:45,594][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:50:46,099][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:50:46,601][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:50:47,108][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:50:47,620][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:50:48,132][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:50:48,650][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:50:49,157][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:50:49,663][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:50:50,166][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:50:50,669][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:50:51,177][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:50:51,685][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:50:52,188][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:50:52,698][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:50:53,204][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:50:53,704][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:50:54,216][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:50:54,739][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:50:55,252][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:50:55,864][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:50:56,372][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:50:56,875][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:50:57,380][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:50:57,886][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:50:58,390][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:50:58,892][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:50:59,404][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:50:59,908][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:51:00,426][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:51:00,937][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:51:01,442][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:51:01,949][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:51:02,453][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:51:02,958][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:51:03,463][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:51:03,976][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:51:04,483][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:51:04,994][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:51:05,502][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:51:06,011][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:51:06,522][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:51:07,042][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:51:07,546][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:51:08,054][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:51:08,557][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:51:09,062][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:51:09,573][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:51:10,080][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:51:10,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10148 tokens. [2025-11-13 03:51:11,466][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.21%, ΔTime: 00:00:33 [2025-11-13 03:51:12,222][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:51:12,224][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:51:12,226][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:51:13,092][__main__][INFO] - Iteration 383 took 1m 2s (41.06% Gen, 57.54% Train). Generation: 25s, Training: 35s. Estimated remaining time: 46h 11m 19s. Estimated total time: 51h 52m 50s. Time estimates for 10 more iterations: 10m 22s, 100 more iterations: 1h 43m 45s, 500 more iterations: 8h 38m 48s. [2025-11-13 03:51:13,094][__main__][INFO] - Starting iteration 383. [2025-11-13 03:51:13,586][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 03:51:13,587][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:51:35,519][__main__][INFO] - Number of regex retries in iteration 383: 0 [2025-11-13 03:51:35,520][__main__][INFO] - agents played in iteration 383 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:51:36,446][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:51:36,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:51:36,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:51:36,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:51:36,523][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:51:36,524][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:51:37,227][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:51:37,687][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:51:38,197][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:51:38,699][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:51:39,202][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:51:39,705][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:51:40,217][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:51:40,725][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:51:41,244][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:51:41,751][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:51:42,258][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:51:42,767][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:51:43,268][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:51:43,780][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:51:44,284][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:51:44,793][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:51:45,299][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:51:45,806][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:51:46,313][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:51:46,817][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:51:47,320][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:51:47,843][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:51:48,348][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:51:48,854][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:51:49,366][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:51:49,868][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:51:50,376][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:51:50,883][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:51:51,385][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:51:51,892][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:51:52,394][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:51:52,898][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:51:53,400][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:51:53,903][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:51:54,419][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:51:54,922][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:51:55,430][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:51:55,935][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:51:56,438][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:51:56,956][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:51:57,463][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:51:57,968][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:51:58,468][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:51:58,976][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:51:59,483][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:51:59,986][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:52:00,489][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:52:00,995][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:52:01,508][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:52:02,011][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:52:02,519][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:52:03,028][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:52:03,551][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:52:04,056][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:52:04,565][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:52:05,068][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:52:05,578][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:52:06,089][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:52:06,598][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:52:07,109][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:52:07,629][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:52:08,139][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:52:08,644][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:52:09,153][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:52:09,663][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10189 tokens. [2025-11-13 03:52:10,485][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.21%, ΔTime: 00:00:33 [2025-11-13 03:52:11,270][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:52:11,272][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:52:11,273][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:52:12,191][__main__][INFO] - Iteration 384 took 58s (37.42% Gen, 61.01% Train). Generation: 21s, Training: 35s. Estimated remaining time: 43h 7m 45s. Estimated total time: 48h 50m 16s. Time estimates for 10 more iterations: 9m 46s, 100 more iterations: 1h 37m 40s, 500 more iterations: 8h 8m 22s. [2025-11-13 03:52:12,193][__main__][INFO] - Starting iteration 384. [2025-11-13 03:52:12,683][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 03:52:12,684][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:52:18,444][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:52:29,028][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:52:30,391][__main__][INFO] - Number of regex retries in iteration 384: 2 [2025-11-13 03:52:30,392][__main__][INFO] - agents played in iteration 384 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:52:31,252][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:52:31,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:52:31,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:52:31,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:52:31,325][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:52:31,326][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:52:32,032][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:52:32,495][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:52:33,003][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:52:33,506][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:52:34,015][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:52:34,527][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:52:35,047][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:52:35,561][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:52:36,070][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:52:36,579][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:52:37,084][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:52:37,589][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:52:38,092][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:52:38,599][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:52:39,106][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:52:39,608][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:52:40,113][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:52:40,622][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:52:41,123][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:52:41,657][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:52:42,160][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:52:42,664][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:52:43,168][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:52:43,671][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:52:44,176][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:52:44,679][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:52:45,184][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:52:45,687][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:52:46,188][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:52:46,690][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:52:47,195][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:52:47,701][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:52:48,203][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:52:48,705][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:52:49,207][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:52:49,714][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:52:50,215][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:52:50,720][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:52:51,226][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:52:51,729][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:52:52,244][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:52:52,744][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:52:53,244][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:52:53,748][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:52:54,251][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:52:54,767][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:52:55,267][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:52:55,767][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:52:56,269][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:52:56,770][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:52:57,280][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:52:57,782][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:52:58,285][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:52:58,792][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:52:59,295][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:52:59,802][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:53:00,303][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:53:00,808][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:53:01,312][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:53:01,815][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:53:02,319][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:53:02,822][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:53:03,326][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:53:03,832][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:53:04,334][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10154 tokens. [2025-11-13 03:53:05,168][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.49%, ΔTime: 00:00:33 [2025-11-13 03:53:05,925][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:53:05,927][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:53:05,928][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:53:06,831][__main__][INFO] - Iteration 385 took 54s (32.70% Gen, 65.63% Train). Generation: 17s, Training: 35s. Estimated remaining time: 39h 24m 0s. Estimated total time: 45h 7m 25s. Time estimates for 10 more iterations: 9m 1s, 100 more iterations: 1h 30m 14s, 500 more iterations: 7h 31m 14s. [2025-11-13 03:53:06,833][__main__][INFO] - Starting iteration 385. [2025-11-13 03:53:07,304][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 03:53:07,305][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:53:26,044][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:53:26,949][__main__][INFO] - Number of regex retries in iteration 385: 1 [2025-11-13 03:53:26,950][__main__][INFO] - agents played in iteration 385 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:53:27,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:53:27,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:53:27,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:53:27,871][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:53:27,871][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:53:27,873][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:53:28,577][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:53:29,038][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:53:29,544][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:53:30,048][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:53:30,554][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:53:31,075][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:53:31,586][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:53:32,096][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:53:32,603][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:53:33,105][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:53:33,608][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:53:34,114][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:53:34,619][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:53:35,127][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:53:35,630][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:53:36,134][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:53:36,637][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:53:37,142][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:53:37,645][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:53:38,150][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:53:38,655][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:53:39,160][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:53:39,661][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:53:40,181][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:53:40,681][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:53:41,186][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:53:41,691][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:53:42,196][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:53:42,702][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:53:43,202][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:53:43,704][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:53:44,206][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:53:44,714][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:53:45,217][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:53:45,718][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:53:46,218][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:53:46,725][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:53:47,230][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:53:47,735][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:53:48,237][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:53:48,741][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:53:49,254][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:53:49,761][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:53:50,273][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:53:50,783][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:53:51,290][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:53:51,794][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:53:52,305][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:53:52,822][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:53:53,334][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:53:53,842][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:53:54,347][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:53:54,855][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:53:55,363][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:53:55,875][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:53:56,382][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:53:56,892][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:53:57,403][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:53:57,907][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:53:58,411][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:53:58,916][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:53:59,422][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:53:59,927][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:54:00,433][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:54:00,940][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10225 tokens. [2025-11-13 03:54:01,754][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.36%, ΔTime: 00:00:33 [2025-11-13 03:54:02,547][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:54:02,549][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:54:02,551][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:54:03,524][__main__][INFO] - Iteration 386 took 56s (34.94% Gen, 63.33% Train). Generation: 19s, Training: 35s. Estimated remaining time: 41h 6m 38s. Estimated total time: 46h 51m 0s. Time estimates for 10 more iterations: 9m 22s, 100 more iterations: 1h 33m 42s, 500 more iterations: 7h 48m 30s. [2025-11-13 03:54:03,526][__main__][INFO] - Starting iteration 386. [2025-11-13 03:54:04,002][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 03:54:04,002][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:54:12,834][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:54:19,934][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:54:20,796][__main__][INFO] - Number of regex retries in iteration 386: 2 [2025-11-13 03:54:20,796][__main__][INFO] - agents played in iteration 386 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:54:21,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:54:21,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:54:21,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:54:21,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:54:21,742][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:54:21,743][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:54:22,436][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:54:22,898][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:54:23,404][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:54:23,908][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:54:24,410][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:54:24,917][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:54:25,420][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:54:25,923][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:54:26,428][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:54:26,930][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:54:27,434][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:54:27,941][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:54:28,443][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:54:28,949][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:54:29,450][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:54:29,951][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:54:30,457][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:54:30,960][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:54:31,463][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:54:31,969][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:54:32,471][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:54:32,993][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:54:33,495][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:54:33,998][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:54:34,503][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:54:35,011][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:54:35,518][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:54:36,020][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:54:36,526][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:54:37,032][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:54:37,537][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:54:38,042][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:54:38,543][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:54:39,045][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:54:39,549][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:54:40,057][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:54:40,560][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:54:41,074][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:54:41,577][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:54:42,093][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:54:42,595][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:54:43,097][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:54:43,607][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:54:44,120][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:54:44,625][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:54:45,130][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:54:45,638][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:54:46,144][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:54:46,648][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:54:47,152][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:54:47,659][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:54:48,162][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:54:48,693][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:54:49,196][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:54:49,705][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:54:50,206][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:54:50,709][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:54:51,215][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:54:51,723][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:54:52,227][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:54:52,731][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:54:53,238][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:54:53,742][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:54:54,245][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:54:54,749][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10179 tokens. [2025-11-13 03:54:55,531][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:33 [2025-11-13 03:54:56,289][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:54:56,291][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:54:56,293][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:54:57,158][__main__][INFO] - Iteration 387 took 53s (31.59% Gen, 66.78% Train). Generation: 16s, Training: 35s. Estimated remaining time: 38h 32m 36s. Estimated total time: 44h 17m 52s. Time estimates for 10 more iterations: 8m 51s, 100 more iterations: 1h 28m 35s, 500 more iterations: 7h 22m 58s. [2025-11-13 03:54:57,160][__main__][INFO] - Starting iteration 387. [2025-11-13 03:54:57,641][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 03:54:57,642][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:55:03,740][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:55:03,833][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:55:15,797][__main__][INFO] - Number of regex retries in iteration 387: 2 [2025-11-13 03:55:15,798][__main__][INFO] - agents played in iteration 387 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:55:16,650][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:55:16,679][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:55:16,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:55:16,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:55:16,728][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:55:16,729][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:55:17,432][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:55:17,890][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:55:18,396][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:55:18,903][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:55:19,408][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:55:19,916][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:55:20,419][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:55:20,922][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:55:21,426][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:55:21,933][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:55:22,437][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:55:22,941][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:55:23,445][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:55:23,948][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:55:24,449][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:55:24,956][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:55:25,463][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:55:25,965][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:55:26,477][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:55:26,982][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:55:27,483][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:55:27,989][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:55:28,490][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:55:29,002][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:55:29,502][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:55:30,006][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:55:30,519][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:55:31,024][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:55:31,524][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:55:32,025][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:55:32,530][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:55:33,032][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:55:33,536][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:55:34,042][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:55:34,549][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:55:35,055][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:55:35,563][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:55:36,065][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:55:36,569][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:55:37,078][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:55:37,581][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:55:38,084][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:55:38,589][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:55:39,092][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:55:39,613][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:55:40,118][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:55:40,625][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:55:41,127][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:55:41,626][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:55:42,130][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:55:42,630][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:55:43,129][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:55:43,635][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:55:44,142][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:55:44,641][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:55:45,147][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:55:45,655][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:55:46,158][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:55:46,665][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:55:47,168][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:55:47,676][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:55:48,184][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:55:48,703][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:55:49,204][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:55:49,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10211 tokens. [2025-11-13 03:55:50,515][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 62.13%, ΔTime: 00:00:33 [2025-11-13 03:55:51,293][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:55:51,295][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:55:51,297][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:55:52,200][__main__][INFO] - Iteration 388 took 54s (33.28% Gen, 65.07% Train). Generation: 18s, Training: 35s. Estimated remaining time: 39h 41m 47s. Estimated total time: 45h 27m 57s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 55s, 500 more iterations: 7h 34m 39s. [2025-11-13 03:55:52,202][__main__][INFO] - Starting iteration 388. [2025-11-13 03:55:52,679][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 03:55:52,680][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:56:01,946][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:56:09,911][__main__][INFO] - Number of regex retries in iteration 388: 1 [2025-11-13 03:56:09,912][__main__][INFO] - agents played in iteration 388 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:56:10,779][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:56:10,807][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:56:10,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:56:10,857][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:56:10,857][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:56:10,858][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:56:11,566][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:56:12,029][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:56:12,540][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:56:13,046][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:56:13,551][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:56:14,061][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:56:14,565][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:56:15,073][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:56:15,577][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:56:16,082][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:56:16,593][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:56:17,099][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:56:17,610][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:56:18,116][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:56:18,623][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:56:19,143][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:56:19,648][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:56:20,147][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:56:20,654][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:56:21,155][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:56:21,658][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:56:22,159][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:56:22,659][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:56:23,161][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:56:23,661][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:56:24,162][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:56:24,663][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:56:25,163][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:56:25,673][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:56:26,178][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:56:26,681][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:56:27,186][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:56:27,694][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:56:28,197][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:56:28,699][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:56:29,204][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:56:29,723][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:56:30,226][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:56:30,728][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:56:31,236][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:56:31,742][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:56:32,257][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:56:32,760][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:56:33,265][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:56:33,771][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:56:34,272][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:56:34,776][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:56:35,280][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:56:35,783][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:56:36,287][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:56:36,788][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:56:37,290][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:56:37,796][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:56:38,301][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:56:38,815][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:56:39,318][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:56:39,821][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:56:40,328][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:56:40,831][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:56:41,338][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:56:41,841][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:56:42,344][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:56:42,846][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:56:43,349][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:56:43,851][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10174 tokens. [2025-11-13 03:56:44,686][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.33%, ΔTime: 00:00:33 [2025-11-13 03:56:45,473][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:56:45,475][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:56:45,476][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:56:46,394][__main__][INFO] - Iteration 389 took 53s (32.08% Gen, 66.21% Train). Generation: 17s, Training: 35s. Estimated remaining time: 38h 58m 41s. Estimated total time: 44h 45m 46s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 31s, 500 more iterations: 7h 27m 37s. [2025-11-13 03:56:46,396][__main__][INFO] - Starting iteration 389. [2025-11-13 03:56:46,862][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 03:56:46,863][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:57:02,452][__main__][INFO] - Number of regex retries in iteration 389: 0 [2025-11-13 03:57:02,453][__main__][INFO] - agents played in iteration 389 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:57:03,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:57:03,248][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:57:03,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:57:03,295][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:57:03,296][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:57:03,297][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:57:03,995][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:57:04,454][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:57:04,967][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:57:05,471][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:57:05,974][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:57:06,474][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:57:06,981][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:57:07,483][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:57:07,984][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:57:08,515][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:57:09,022][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:57:09,526][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:57:10,035][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:57:10,541][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:57:11,051][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:57:11,566][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:57:12,074][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:57:12,577][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:57:13,082][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:57:13,586][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:57:14,091][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:57:14,597][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:57:15,101][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:57:15,607][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:57:16,114][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:57:16,615][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:57:17,123][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:57:17,628][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:57:18,149][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:57:18,654][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:57:19,162][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:57:19,666][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:57:20,174][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:57:20,680][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:57:21,184][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:57:21,689][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:57:22,197][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:57:22,700][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:57:23,208][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:57:23,717][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:57:24,222][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:57:24,742][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:57:25,248][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:57:25,760][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:57:26,268][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:57:26,781][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:57:27,292][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:57:27,799][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:57:28,305][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:57:28,816][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:57:29,324][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:57:29,833][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:57:30,341][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:57:30,847][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:57:31,353][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:57:31,859][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:57:32,368][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:57:32,870][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:57:33,373][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:57:33,880][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:57:34,385][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:57:34,894][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:57:35,399][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:57:35,905][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:57:36,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10148 tokens. [2025-11-13 03:57:37,225][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.13%, ΔTime: 00:00:33 [2025-11-13 03:57:38,019][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:57:38,021][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:57:38,023][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:57:39,016][__main__][INFO] - Iteration 390 took 52s (29.89% Gen, 68.20% Train). Generation: 15s, Training: 35s. Estimated remaining time: 37h 39m 45s. Estimated total time: 43h 27m 42s. Time estimates for 10 more iterations: 8m 41s, 100 more iterations: 1h 26m 55s, 500 more iterations: 7h 14m 37s. [2025-11-13 03:57:39,018][__main__][INFO] - Starting iteration 390. [2025-11-13 03:57:39,500][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 38 and human policies 1. [2025-11-13 03:57:39,500][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:57:53,811][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:57:56,043][__main__][INFO] - Number of regex retries in iteration 390: 1 [2025-11-13 03:57:56,044][__main__][INFO] - agents played in iteration 390 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:57:56,826][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:57:56,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:57:56,891][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:57:56,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:57:56,916][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:57:56,917][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:57:57,612][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:57:58,072][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:57:58,576][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:57:59,076][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:57:59,581][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:58:00,081][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:58:00,582][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:58:01,085][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:58:01,584][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:58:02,093][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:58:02,607][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:58:03,111][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:58:03,620][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:58:04,123][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:58:04,627][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:58:05,136][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:58:05,639][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:58:06,158][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:58:06,659][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:58:07,167][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:58:07,669][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:58:08,178][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:58:08,680][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:58:09,180][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:58:09,681][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:58:10,183][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:58:10,685][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:58:11,190][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:58:11,695][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:58:12,198][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:58:12,702][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:58:13,206][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:58:13,708][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:58:14,215][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:58:14,717][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:58:15,219][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:58:15,721][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:58:16,224][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:58:16,742][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:58:17,243][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:58:17,746][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:58:18,251][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:58:18,753][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:58:19,269][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:58:19,774][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:58:20,280][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:58:20,787][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:58:21,290][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:58:21,798][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:58:22,302][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:58:22,832][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:58:23,345][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:58:23,851][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:58:24,363][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:58:24,870][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:58:25,378][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:58:25,889][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:58:26,398][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:58:26,901][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:58:27,405][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:58:27,915][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:58:28,419][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:58:28,923][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:58:29,430][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:58:29,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10220 tokens. [2025-11-13 03:58:30,736][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.22%, ΔTime: 00:00:33 [2025-11-13 03:58:31,504][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:58:31,506][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:58:31,507][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:58:33,264][__main__][INFO] - Iteration 391 took 53s (30.77% Gen, 65.96% Train). Generation: 16s, Training: 35s. Estimated remaining time: 38h 59m 21s. Estimated total time: 44h 48m 13s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 36s, 500 more iterations: 7h 28m 2s. [2025-11-13 03:58:33,266][__main__][INFO] - Starting iteration 391. [2025-11-13 03:58:33,749][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 03:58:33,749][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:58:47,935][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 1 y book, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:58:52,485][__main__][INFO] - Number of regex retries in iteration 391: 1 [2025-11-13 03:58:52,485][__main__][INFO] - agents played in iteration 391 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:58:53,301][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:58:53,326][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:58:53,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:58:53,373][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:58:53,374][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:58:53,375][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:58:54,079][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:58:54,559][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:58:55,064][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:58:55,566][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:58:56,072][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:58:56,580][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:58:57,085][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:58:57,588][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:58:58,096][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:58:58,609][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:58:59,114][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:58:59,619][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:59:00,126][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:59:00,634][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:59:01,140][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:59:01,644][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:59:02,148][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:59:02,654][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:59:03,163][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:59:03,682][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 03:59:04,189][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 03:59:04,692][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 03:59:05,200][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 03:59:05,705][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 03:59:06,219][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 03:59:06,726][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 03:59:07,232][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 03:59:07,765][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 03:59:08,297][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 03:59:08,804][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 03:59:09,314][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 03:59:09,819][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 03:59:10,327][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 03:59:10,831][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 03:59:11,338][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 03:59:11,860][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 03:59:12,371][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 03:59:12,881][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 03:59:13,385][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 03:59:13,898][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 03:59:14,406][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 03:59:14,913][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 03:59:15,421][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 03:59:15,924][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 03:59:16,430][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 03:59:16,937][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 03:59:17,442][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 03:59:17,953][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 03:59:18,463][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 03:59:18,975][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 03:59:19,492][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 03:59:20,001][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 03:59:20,511][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 03:59:21,019][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 03:59:21,526][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 03:59:22,039][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 03:59:22,547][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 03:59:23,054][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 03:59:23,562][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 03:59:24,075][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 03:59:24,590][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 03:59:25,097][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 03:59:25,601][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 03:59:26,107][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 03:59:26,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10366 tokens. [2025-11-13 03:59:27,391][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 03:59:28,183][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 03:59:28,185][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 03:59:28,187][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 03:59:29,146][__main__][INFO] - Iteration 392 took 55s (33.82% Gen, 64.44% Train). Generation: 18s, Training: 35s. Estimated remaining time: 40h 20m 7s. Estimated total time: 46h 9m 55s. Time estimates for 10 more iterations: 9m 13s, 100 more iterations: 1h 32m 19s, 500 more iterations: 7h 41m 39s. [2025-11-13 03:59:29,148][__main__][INFO] - Starting iteration 392. [2025-11-13 03:59:29,638][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 03:59:29,639][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 03:59:43,112][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 03:59:48,598][__main__][INFO] - Number of regex retries in iteration 392: 1 [2025-11-13 03:59:48,599][__main__][INFO] - agents played in iteration 392 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 03:59:49,475][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:59:49,500][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:59:49,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:59:49,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 03:59:49,550][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 03:59:49,551][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 03:59:50,276][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 03:59:50,762][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 03:59:51,281][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 03:59:51,787][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 03:59:52,290][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 03:59:52,797][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 03:59:53,306][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 03:59:53,807][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 03:59:54,309][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 03:59:54,814][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 03:59:55,323][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 03:59:55,826][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 03:59:56,327][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 03:59:56,836][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 03:59:57,347][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 03:59:57,851][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 03:59:58,353][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 03:59:58,860][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 03:59:59,369][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 03:59:59,878][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:00:00,382][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:00:00,888][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:00:01,390][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:00:01,894][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:00:02,398][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:00:02,901][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:00:03,414][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:00:03,917][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:00:04,422][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:00:04,927][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:00:05,434][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:00:05,939][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:00:06,453][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:00:06,960][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:00:07,464][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:00:07,973][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:00:08,479][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:00:08,995][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:00:09,500][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:00:10,005][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:00:10,506][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:00:11,011][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:00:11,516][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:00:12,021][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:00:12,523][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:00:13,025][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:00:13,527][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:00:14,035][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:00:14,543][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:00:15,045][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:00:15,549][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:00:16,059][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:00:16,566][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:00:17,072][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:00:17,578][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:00:18,088][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:00:18,593][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:00:19,095][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:00:19,600][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:00:20,101][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:00:20,605][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:00:21,104][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:00:21,605][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:00:22,106][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:00:22,608][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10204 tokens. [2025-11-13 04:00:23,355][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.12%, ΔTime: 00:00:33 [2025-11-13 04:00:24,131][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:00:24,133][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:00:24,135][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:00:25,062][__main__][INFO] - Iteration 393 took 55s (34.21% Gen, 64.12% Train). Generation: 18s, Training: 35s. Estimated remaining time: 40h 20m 28s. Estimated total time: 46h 11m 11s. Time estimates for 10 more iterations: 9m 14s, 100 more iterations: 1h 32m 22s, 500 more iterations: 7h 41m 51s. [2025-11-13 04:00:25,064][__main__][INFO] - Starting iteration 393. [2025-11-13 04:00:25,556][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 04:00:25,557][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:00:41,943][__main__][INFO] - Number of regex retries in iteration 393: 0 [2025-11-13 04:00:41,943][__main__][INFO] - agents played in iteration 393 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:00:42,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:00:42,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:00:42,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:00:42,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:00:42,883][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:00:42,884][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:00:43,636][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:00:44,097][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:00:44,613][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:00:45,116][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:00:45,619][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:00:46,124][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:00:46,631][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:00:47,133][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:00:47,639][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:00:48,142][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:00:48,650][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:00:49,152][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:00:49,656][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:00:50,164][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:00:50,667][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:00:51,169][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:00:51,672][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:00:52,178][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:00:52,699][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:00:53,201][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:00:53,705][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:00:54,207][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:00:54,709][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:00:55,213][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:00:55,714][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:00:56,219][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:00:56,726][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:00:57,234][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:00:57,740][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:00:58,243][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:00:58,747][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:00:59,260][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:00:59,766][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:01:00,271][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:01:00,771][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:01:01,278][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:01:01,800][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:01:02,303][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:01:02,809][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:01:03,308][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:01:03,809][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:01:04,315][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:01:04,816][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:01:05,320][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:01:05,823][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:01:06,331][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:01:06,833][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:01:07,332][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:01:07,835][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:01:08,336][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:01:08,835][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:01:09,339][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:01:09,841][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:01:10,342][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:01:10,845][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:01:11,344][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:01:11,845][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:01:12,347][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:01:12,847][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:01:13,351][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:01:13,858][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:01:14,363][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:01:14,866][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:01:15,368][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:01:15,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10102 tokens. [2025-11-13 04:01:16,635][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.04%, Current % of VRAM taken: 58.28%, Block Peak % of device VRAM: 62.47%, ΔTime: 00:00:33 [2025-11-13 04:01:17,389][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:01:17,391][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:01:17,393][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:01:18,355][__main__][INFO] - Iteration 394 took 52s (31.04% Gen, 67.14% Train). Generation: 16s, Training: 35s. Estimated remaining time: 38h 8m 20s. Estimated total time: 43h 59m 57s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 59s, 500 more iterations: 7h 19m 59s. [2025-11-13 04:01:18,357][__main__][INFO] - Starting iteration 394. [2025-11-13 04:01:18,835][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 04:01:18,835][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:01:35,272][__main__][INFO] - Number of regex retries in iteration 394: 0 [2025-11-13 04:01:35,273][__main__][INFO] - agents played in iteration 394 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:01:36,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:01:36,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:01:36,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:01:36,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:01:36,214][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:01:36,215][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:01:37,025][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:01:37,486][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:01:37,997][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:01:38,507][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:01:39,010][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:01:39,517][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:01:40,022][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:01:40,530][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:01:41,034][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:01:41,538][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:01:42,053][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:01:42,556][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:01:43,059][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:01:43,587][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:01:44,089][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:01:44,595][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:01:45,100][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:01:45,607][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:01:46,118][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:01:46,625][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:01:47,126][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:01:47,631][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:01:48,131][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:01:48,638][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:01:49,139][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:01:49,647][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:01:50,156][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:01:50,658][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:01:51,161][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:01:51,666][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:01:52,193][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:01:52,718][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:01:53,223][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:01:53,724][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:01:54,226][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:01:54,733][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:01:55,238][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:01:55,738][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:01:56,242][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:01:56,749][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:01:57,252][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:01:57,754][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:01:58,255][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:01:58,755][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:01:59,255][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:01:59,755][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:02:00,254][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:02:00,756][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:02:01,259][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:02:01,758][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:02:02,258][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:02:02,757][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:02:03,261][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:02:03,766][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:02:04,272][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:02:04,773][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:02:05,277][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:02:05,781][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:02:06,285][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:02:06,793][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:02:07,297][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:02:07,802][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:02:08,304][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:02:08,807][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:02:09,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10139 tokens. [2025-11-13 04:02:10,092][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 04:02:10,890][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:02:10,891][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:02:10,894][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:02:11,795][__main__][INFO] - Iteration 395 took 52s (31.04% Gen, 67.26% Train). Generation: 16s, Training: 35s. Estimated remaining time: 38h 15m 33s. Estimated total time: 44h 8m 3s. Time estimates for 10 more iterations: 8m 49s, 100 more iterations: 1h 28m 16s, 500 more iterations: 7h 21m 20s. [2025-11-13 04:02:11,797][__main__][INFO] - Starting iteration 395. [2025-11-13 04:02:12,285][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 04:02:12,285][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:02:16,677][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 04:02:27,812][__main__][INFO] - Number of regex retries in iteration 395: 1 [2025-11-13 04:02:27,813][__main__][INFO] - agents played in iteration 395 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:02:28,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:02:28,621][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:02:28,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:02:28,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:02:28,670][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:02:28,671][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:02:29,376][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:02:29,838][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:02:30,348][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:02:30,853][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:02:31,359][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:02:31,870][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:02:32,374][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:02:32,879][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:02:33,385][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:02:33,888][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:02:34,395][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:02:34,904][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:02:35,411][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:02:35,923][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:02:36,429][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:02:36,945][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:02:37,453][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:02:37,956][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:02:38,466][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:02:38,968][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:02:39,475][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:02:39,980][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:02:40,481][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:02:40,984][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:02:41,485][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:02:41,992][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:02:42,501][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:02:43,004][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:02:43,507][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:02:44,009][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:02:44,511][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:02:45,031][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:02:45,533][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:02:46,039][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:02:46,540][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:02:47,045][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:02:47,554][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:02:48,063][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:02:48,572][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:02:49,082][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:02:49,590][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:02:50,093][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:02:50,598][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:02:51,105][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:02:51,610][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:02:52,112][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:02:52,614][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:02:53,122][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:02:53,623][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:02:54,138][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:02:54,638][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:02:55,143][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:02:55,645][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:02:56,144][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:02:56,647][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:02:57,154][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:02:57,664][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:02:58,178][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:02:58,684][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:02:59,190][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:02:59,700][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:03:00,207][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:03:00,726][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:03:01,232][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:03:01,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10223 tokens. [2025-11-13 04:03:02,523][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.06%, Current % of VRAM taken: 58.30%, Block Peak % of device VRAM: 62.21%, ΔTime: 00:00:33 [2025-11-13 04:03:03,315][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:03:03,316][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:03:03,318][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:03:04,235][__main__][INFO] - Iteration 396 took 51s (29.89% Gen, 68.34% Train). Generation: 15s, Training: 35s. Estimated remaining time: 37h 24m 8s. Estimated total time: 43h 17m 31s. Time estimates for 10 more iterations: 8m 39s, 100 more iterations: 1h 26m 35s, 500 more iterations: 7h 12m 55s. [2025-11-13 04:03:04,237][__main__][INFO] - Starting iteration 396. [2025-11-13 04:03:04,708][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 04:03:04,709][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:03:09,033][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 04:03:21,095][__main__][INFO] - Number of regex retries in iteration 396: 1 [2025-11-13 04:03:21,095][__main__][INFO] - agents played in iteration 396 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:03:21,881][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:03:21,909][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:03:21,935][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:03:21,958][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:03:21,959][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:03:21,959][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:03:22,698][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:03:23,163][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:03:23,672][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:03:24,182][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:03:24,695][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:03:25,201][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:03:25,706][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:03:26,221][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:03:26,728][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:03:27,249][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:03:27,752][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:03:28,261][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:03:28,766][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:03:29,273][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:03:29,783][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:03:30,289][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:03:30,791][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:03:31,300][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:03:31,801][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:03:32,303][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:03:32,811][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:03:33,320][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:03:33,822][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:03:34,326][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:03:34,831][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:03:35,337][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:03:35,840][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:03:36,353][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:03:36,859][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:03:37,364][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:03:37,872][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:03:38,377][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:03:38,881][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:03:39,383][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:03:39,884][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:03:40,388][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:03:40,895][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:03:41,399][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:03:41,904][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:03:42,404][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:03:42,915][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:03:43,420][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:03:43,925][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:03:44,436][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:03:44,939][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:03:45,443][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:03:45,946][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:03:46,450][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:03:46,960][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:03:47,462][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:03:47,962][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:03:48,465][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:03:48,967][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:03:49,473][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:03:49,983][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:03:50,483][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:03:50,991][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:03:51,492][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:03:52,003][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:03:52,509][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:03:53,026][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:03:53,546][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:03:54,048][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:03:54,554][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:03:55,055][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10206 tokens. [2025-11-13 04:03:55,855][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:33 [2025-11-13 04:03:56,625][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:03:56,627][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:03:56,628][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:03:57,469][__main__][INFO] - Iteration 397 took 52s (31.06% Gen, 67.35% Train). Generation: 16s, Training: 35s. Estimated remaining time: 38h 3m 48s. Estimated total time: 43h 58m 4s. Time estimates for 10 more iterations: 8m 47s, 100 more iterations: 1h 27m 56s, 500 more iterations: 7h 19m 40s. [2025-11-13 04:03:57,471][__main__][INFO] - Starting iteration 397. [2025-11-13 04:03:57,960][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 04:03:57,960][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:04:15,163][__main__][INFO] - Number of regex retries in iteration 397: 0 [2025-11-13 04:04:15,164][__main__][INFO] - agents played in iteration 397 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:04:15,975][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:04:16,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:04:16,029][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:04:16,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:04:16,053][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:04:16,054][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:04:16,834][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:04:17,298][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:04:17,811][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:04:18,330][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:04:18,840][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:04:19,348][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:04:19,860][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:04:20,363][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:04:20,873][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:04:21,380][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:04:21,883][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:04:22,387][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:04:22,889][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:04:23,396][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:04:23,901][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:04:24,402][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:04:24,903][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:04:25,407][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:04:25,912][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:04:26,424][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:04:26,924][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:04:27,431][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:04:27,932][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:04:28,436][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:04:28,958][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:04:29,462][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:04:29,970][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:04:30,471][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:04:30,973][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:04:31,481][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:04:31,985][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:04:32,491][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:04:32,992][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:04:33,494][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:04:33,995][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:04:34,497][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:04:35,004][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:04:35,515][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:04:36,015][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:04:36,521][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:04:37,029][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:04:37,531][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:04:38,045][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:04:38,548][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:04:39,053][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:04:39,558][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:04:40,058][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:04:40,561][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:04:41,060][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:04:41,570][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:04:42,076][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:04:42,582][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:04:43,090][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:04:43,597][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:04:44,104][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:04:44,642][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:04:45,145][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:04:45,653][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:04:46,163][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:04:46,666][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:04:47,170][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:04:47,677][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:04:48,184][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:04:48,688][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:04:49,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10263 tokens. [2025-11-13 04:04:49,981][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.33%, ΔTime: 00:00:33 [2025-11-13 04:04:50,754][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:04:50,756][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:04:50,757][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:04:51,711][__main__][INFO] - Iteration 398 took 53s (32.00% Gen, 66.22% Train). Generation: 17s, Training: 35s. Estimated remaining time: 38h 52m 25s. Estimated total time: 44h 47m 35s. Time estimates for 10 more iterations: 8m 57s, 100 more iterations: 1h 29m 35s, 500 more iterations: 7h 27m 55s. [2025-11-13 04:04:51,715][__main__][INFO] - Starting iteration 398. [2025-11-13 04:04:52,200][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 04:04:52,201][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:04:59,331][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 04:05:11,970][__main__][INFO] - Number of regex retries in iteration 398: 1 [2025-11-13 04:05:11,971][__main__][INFO] - agents played in iteration 398 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:05:12,849][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:05:12,873][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:05:12,897][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:05:12,920][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:05:12,921][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:05:12,922][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:05:13,728][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:05:14,194][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:05:14,708][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:05:15,212][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:05:15,717][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:05:16,231][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:05:16,733][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:05:17,237][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:05:17,746][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:05:18,250][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:05:18,759][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:05:19,262][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:05:19,763][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:05:20,269][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:05:20,771][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:05:21,276][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:05:21,777][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:05:22,281][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:05:22,786][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:05:23,291][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:05:23,796][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:05:24,303][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:05:24,808][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:05:25,312][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:05:25,823][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:05:26,324][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:05:26,843][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:05:27,347][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:05:27,852][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:05:28,354][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:05:28,860][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:05:29,363][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:05:29,864][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:05:30,367][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:05:30,875][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:05:31,382][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:05:31,884][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:05:32,385][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:05:32,885][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:05:33,388][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:05:33,891][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:05:34,391][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:05:34,891][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:05:35,394][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:05:35,895][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:05:36,399][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:05:36,905][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:05:37,413][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:05:37,919][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:05:38,426][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:05:38,940][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:05:39,440][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:05:39,946][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:05:40,446][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:05:40,949][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:05:41,465][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:05:41,966][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:05:42,474][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:05:42,980][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:05:43,487][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:05:43,997][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:05:44,504][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:05:45,008][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:05:45,517][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:05:46,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10169 tokens. [2025-11-13 04:05:46,781][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.26%, Current % of VRAM taken: 58.50%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 04:05:47,561][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:05:47,567][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:05:47,570][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:05:48,500][__main__][INFO] - Iteration 399 took 56s (35.12% Gen, 63.23% Train). Generation: 19s, Training: 35s. Estimated remaining time: 40h 58m 55s. Estimated total time: 46h 55m 2s. Time estimates for 10 more iterations: 9m 23s, 100 more iterations: 1h 33m 50s, 500 more iterations: 7h 49m 10s. [2025-11-13 04:05:48,502][__main__][INFO] - Starting iteration 399. [2025-11-13 04:05:48,993][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 04:05:48,994][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:06:06,941][__main__][INFO] - Number of regex retries in iteration 399: 0 [2025-11-13 04:06:06,942][__main__][INFO] - agents played in iteration 399 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:06:07,780][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:06:07,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:06:07,826][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:06:07,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:06:07,848][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:06:07,849][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:06:08,643][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:06:09,103][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:06:09,610][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:06:10,119][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:06:10,621][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:06:11,128][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:06:11,640][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:06:12,145][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:06:12,648][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:06:13,149][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:06:13,650][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:06:14,173][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:06:14,678][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:06:15,184][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:06:15,691][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:06:16,191][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:06:16,700][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:06:17,204][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:06:17,711][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:06:18,218][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:06:18,729][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:06:19,238][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:06:19,743][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:06:20,252][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:06:20,762][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:06:21,262][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:06:21,764][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:06:22,267][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:06:22,778][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:06:23,287][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:06:23,799][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:06:24,307][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:06:24,810][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:06:25,313][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:06:25,816][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:06:26,315][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:06:26,816][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:06:27,340][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:06:27,838][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:06:28,338][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:06:28,845][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:06:29,343][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:06:29,853][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:06:30,352][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:06:30,852][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:06:31,355][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:06:31,855][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:06:32,356][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:06:32,860][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:06:33,366][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:06:33,878][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:06:34,381][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:06:34,889][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:06:35,395][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:06:35,896][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:06:36,400][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:06:36,905][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:06:37,407][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:06:37,914][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:06:38,418][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:06:38,924][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:06:39,440][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:06:39,942][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:06:40,460][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:06:40,963][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10171 tokens. [2025-11-13 04:06:41,745][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 62.33%, ΔTime: 00:00:33 [2025-11-13 04:06:42,532][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:06:42,533][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:06:42,535][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:06:43,547][__main__][INFO] - Iteration 400 took 54s (32.90% Gen, 65.24% Train). Generation: 17s, Training: 35s. Estimated remaining time: 39h 30m 42s. Estimated total time: 45h 27m 44s. Time estimates for 10 more iterations: 9m 5s, 100 more iterations: 1h 30m 55s, 500 more iterations: 7h 34m 37s. [2025-11-13 04:06:43,549][__main__][INFO] - Starting iteration 400. [2025-11-13 04:06:44,053][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 39 and human policies 1. [2025-11-13 04:06:44,054][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:07:05,488][__main__][INFO] - Number of regex retries in iteration 400: 0 [2025-11-13 04:07:05,488][__main__][INFO] - agents played in iteration 400 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:07:06,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:07:06,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:07:06,370][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:07:06,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:07:06,394][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:07:06,395][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:07:07,169][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:07:07,631][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:07:08,168][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:07:08,674][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:07:09,183][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:07:09,693][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:07:10,198][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:07:10,705][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:07:11,208][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:07:11,712][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:07:12,224][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:07:12,726][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:07:13,242][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:07:13,747][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:07:14,250][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:07:14,770][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:07:15,273][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:07:15,778][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:07:16,281][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:07:16,784][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:07:17,286][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:07:17,792][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:07:18,302][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:07:18,806][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:07:19,311][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:07:19,814][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:07:20,321][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:07:20,829][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:07:21,338][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:07:21,844][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:07:22,346][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:07:22,865][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:07:23,366][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:07:23,866][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:07:24,365][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:07:24,865][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:07:25,373][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:07:25,873][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:07:26,375][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:07:26,873][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:07:27,373][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:07:27,876][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:07:28,379][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:07:28,882][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:07:29,385][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:07:29,884][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:07:30,388][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:07:30,888][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:07:31,388][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:07:31,899][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:07:32,405][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:07:32,908][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:07:33,413][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:07:33,916][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:07:34,423][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:07:34,929][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:07:35,434][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:07:35,939][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:07:36,442][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:07:36,945][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:07:37,451][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:07:37,956][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:07:38,467][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:07:38,971][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:07:39,478][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10135 tokens. [2025-11-13 04:07:40,258][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.42%, ΔTime: 00:00:33 [2025-11-13 04:07:41,044][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:07:41,046][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:07:41,047][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:07:42,807][__main__][INFO] - Iteration 401 took 58s (36.48% Gen, 60.52% Train). Generation: 21s, Training: 35s. Estimated remaining time: 42h 59m 41s. Estimated total time: 48h 57m 42s. Time estimates for 10 more iterations: 9m 47s, 100 more iterations: 1h 37m 55s, 500 more iterations: 8h 9m 37s. [2025-11-13 04:07:42,809][__main__][INFO] - Starting iteration 401. [2025-11-13 04:07:43,300][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 04:07:43,301][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:07:53,837][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 04:08:06,169][__main__][INFO] - Number of regex retries in iteration 401: 1 [2025-11-13 04:08:06,170][__main__][INFO] - agents played in iteration 401 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:08:06,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:08:06,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:08:07,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:08:07,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:08:07,033][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:08:07,034][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:08:07,790][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:08:08,252][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:08:08,764][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:08:09,272][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:08:09,779][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:08:10,283][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:08:10,786][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:08:11,296][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:08:11,799][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:08:12,309][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:08:12,814][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:08:13,319][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:08:13,823][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:08:14,331][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:08:14,836][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:08:15,342][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:08:15,845][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:08:16,348][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:08:16,865][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:08:17,367][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:08:17,879][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:08:18,380][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:08:18,878][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:08:19,396][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:08:19,897][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:08:20,400][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:08:20,902][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:08:21,401][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:08:21,902][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:08:22,401][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:08:22,900][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:08:23,404][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:08:23,903][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:08:24,410][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:08:24,909][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:08:25,409][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:08:25,912][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:08:26,411][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:08:26,916][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:08:27,416][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:08:27,919][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:08:28,421][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:08:28,921][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:08:29,424][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:08:29,929][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:08:30,428][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:08:30,930][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:08:31,435][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:08:31,941][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:08:32,449][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:08:32,950][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:08:33,452][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:08:33,969][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:08:34,480][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:08:34,992][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:08:35,494][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:08:36,002][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:08:36,505][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:08:37,007][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:08:37,509][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:08:38,020][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:08:38,529][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:08:39,034][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:08:39,539][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:08:40,047][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10157 tokens. [2025-11-13 04:08:40,861][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:33 [2025-11-13 04:08:41,636][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:08:41,638][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:08:41,640][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:08:42,570][__main__][INFO] - Iteration 402 took 59s (38.58% Gen, 59.85% Train). Generation: 22s, Training: 35s. Estimated remaining time: 43h 24m 31s. Estimated total time: 49h 23m 32s. Time estimates for 10 more iterations: 9m 52s, 100 more iterations: 1h 38m 47s, 500 more iterations: 8h 13m 55s. [2025-11-13 04:08:42,572][__main__][INFO] - Starting iteration 402. [2025-11-13 04:08:43,068][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 04:08:43,068][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:08:48,964][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 04:08:59,398][__main__][INFO] - Number of regex retries in iteration 402: 1 [2025-11-13 04:08:59,399][__main__][INFO] - agents played in iteration 402 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:09:00,204][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:09:00,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:09:00,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:09:00,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:09:00,282][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:09:00,283][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:09:01,000][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:09:01,479][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:09:01,987][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:09:02,489][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:09:02,991][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:09:03,492][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:09:04,004][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:09:04,512][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:09:05,017][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:09:05,522][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:09:06,024][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:09:06,527][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:09:07,036][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:09:07,540][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:09:08,045][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:09:08,555][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:09:09,063][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:09:09,569][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:09:10,076][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:09:10,581][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:09:11,085][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:09:11,591][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:09:12,093][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:09:12,592][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:09:13,101][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:09:13,604][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:09:14,104][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:09:14,604][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:09:15,103][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:09:15,602][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:09:16,101][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:09:16,600][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:09:17,107][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:09:17,610][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:09:18,115][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:09:18,613][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:09:19,118][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:09:19,625][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:09:20,126][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:09:20,626][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:09:21,126][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:09:21,626][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:09:22,126][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:09:22,624][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:09:23,127][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:09:23,627][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:09:24,131][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:09:24,632][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:09:25,137][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:09:25,639][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:09:26,141][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:09:26,645][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:09:27,144][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:09:27,644][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:09:28,146][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:09:28,648][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:09:29,149][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:09:29,652][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:09:30,180][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:09:30,685][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:09:31,194][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:09:31,698][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:09:32,206][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:09:32,713][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:09:33,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10105 tokens. [2025-11-13 04:09:34,027][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.13%, ΔTime: 00:00:33 [2025-11-13 04:09:34,793][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:09:34,794][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:09:34,796][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:09:35,726][__main__][INFO] - Iteration 403 took 52s (31.01% Gen, 67.22% Train). Generation: 16s, Training: 35s. Estimated remaining time: 37h 53m 3s. Estimated total time: 43h 52m 57s. Time estimates for 10 more iterations: 8m 46s, 100 more iterations: 1h 27m 45s, 500 more iterations: 7h 18m 49s. [2025-11-13 04:09:35,728][__main__][INFO] - Starting iteration 403. [2025-11-13 04:09:36,202][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 04:09:36,202][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:09:56,979][__main__][INFO] - Number of regex retries in iteration 403: 0 [2025-11-13 04:09:56,979][__main__][INFO] - agents played in iteration 403 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:09:57,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:09:57,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:09:57,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:09:57,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:09:57,925][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:09:57,926][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:09:58,671][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:09:59,133][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:09:59,639][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:10:00,144][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:10:00,647][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:10:01,151][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:10:01,664][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:10:02,165][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:10:02,668][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:10:03,183][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:10:03,683][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:10:04,208][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:10:04,709][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:10:05,215][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:10:05,718][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:10:06,224][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:10:06,729][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:10:07,229][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:10:07,732][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:10:08,234][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:10:08,738][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:10:09,241][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:10:09,769][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:10:10,273][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:10:10,779][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:10:11,280][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:10:11,781][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:10:12,290][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:10:12,789][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:10:13,293][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:10:13,803][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:10:14,304][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:10:14,812][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:10:15,311][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:10:15,811][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:10:16,314][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:10:16,815][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:10:17,321][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:10:17,824][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:10:18,322][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:10:18,829][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:10:19,338][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:10:19,841][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:10:20,344][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:10:20,846][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:10:21,350][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:10:21,852][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:10:22,355][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:10:22,861][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:10:23,363][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:10:23,862][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:10:24,363][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:10:24,866][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:10:25,369][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:10:25,872][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:10:26,377][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:10:26,894][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:10:27,400][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:10:27,911][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:10:28,415][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:10:28,921][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:10:29,428][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:10:29,931][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:10:30,435][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:10:30,943][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10205 tokens. [2025-11-13 04:10:31,727][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 04:10:32,488][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:10:32,490][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:10:32,491][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:10:33,631][__main__][INFO] - Iteration 404 took 57s (36.18% Gen, 61.83% Train). Generation: 20s, Training: 35s. Estimated remaining time: 41h 50m 38s. Estimated total time: 47h 51m 30s. Time estimates for 10 more iterations: 9m 34s, 100 more iterations: 1h 35m 43s, 500 more iterations: 7h 58m 35s. [2025-11-13 04:10:33,633][__main__][INFO] - Starting iteration 404. [2025-11-13 04:10:34,115][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 04:10:34,116][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:10:50,416][__main__][INFO] - Number of regex retries in iteration 404: 0 [2025-11-13 04:10:50,416][__main__][INFO] - agents played in iteration 404 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:10:51,196][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:10:51,225][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:10:51,252][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:10:51,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:10:51,275][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:10:51,276][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:10:52,023][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:10:52,483][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:10:52,992][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:10:53,499][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:10:54,001][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:10:54,518][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:10:55,020][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:10:55,522][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:10:56,024][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:10:56,532][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:10:57,042][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:10:57,545][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:10:58,053][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:10:58,557][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:10:59,063][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:10:59,573][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:11:00,075][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:11:00,578][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:11:01,113][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:11:01,617][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:11:02,141][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:11:02,640][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:11:03,142][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:11:03,644][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:11:04,144][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:11:04,644][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:11:05,144][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:11:05,644][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:11:06,149][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:11:06,649][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:11:07,150][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:11:07,651][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:11:08,151][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:11:08,650][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:11:09,156][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:11:09,660][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:11:10,163][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:11:10,665][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:11:11,164][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:11:11,664][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:11:12,162][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:11:12,668][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:11:13,173][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:11:13,676][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:11:14,189][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:11:14,693][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:11:15,196][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:11:15,706][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:11:16,208][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:11:16,715][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:11:17,219][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:11:17,721][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:11:18,221][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:11:18,727][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:11:19,229][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:11:19,736][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:11:20,240][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:11:20,747][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:11:21,249][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:11:21,752][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:11:22,261][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:11:22,766][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:11:23,275][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:11:23,780][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:11:24,290][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10202 tokens. [2025-11-13 04:11:25,079][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 04:11:25,845][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:11:25,848][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:11:25,850][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:11:26,917][__main__][INFO] - Iteration 405 took 52s (30.87% Gen, 67.11% Train). Generation: 16s, Training: 35s. Estimated remaining time: 37h 58m 20s. Estimated total time: 44h 0m 5s. Time estimates for 10 more iterations: 8m 48s, 100 more iterations: 1h 28m 0s, 500 more iterations: 7h 20m 0s. [2025-11-13 04:11:26,919][__main__][INFO] - Starting iteration 405. [2025-11-13 04:11:27,388][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 04:11:27,389][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:11:43,731][__main__][INFO] - Number of regex retries in iteration 405: 0 [2025-11-13 04:11:43,732][__main__][INFO] - agents played in iteration 405 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:11:44,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:11:44,608][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:11:44,632][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:11:44,654][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:11:44,655][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:11:44,656][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:11:45,406][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:11:45,867][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:11:46,380][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:11:46,892][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:11:47,401][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:11:47,906][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:11:48,418][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:11:48,922][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:11:49,445][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:11:49,949][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:11:50,458][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:11:50,963][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:11:51,468][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:11:51,972][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:11:52,481][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:11:52,986][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:11:53,498][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:11:54,004][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:11:54,510][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:11:55,013][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:11:55,516][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:11:56,020][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:11:56,524][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:11:57,032][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:11:57,534][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:11:58,044][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:11:58,556][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:11:59,063][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:11:59,571][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:12:00,077][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:12:00,578][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:12:01,084][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:12:01,589][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:12:02,092][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:12:02,599][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:12:03,101][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:12:03,609][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:12:04,109][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:12:04,612][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:12:05,118][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:12:05,618][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:12:06,120][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:12:06,626][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:12:07,132][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:12:07,648][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:12:08,149][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:12:08,654][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:12:09,156][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:12:09,659][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:12:10,165][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:12:10,666][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:12:11,167][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:12:11,673][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:12:12,184][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:12:12,691][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:12:13,199][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:12:13,706][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:12:14,210][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:12:14,713][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:12:15,221][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:12:15,725][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:12:16,232][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:12:16,747][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:12:17,254][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:12:17,765][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10213 tokens. [2025-11-13 04:12:18,573][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.17%, ΔTime: 00:00:33 [2025-11-13 04:12:19,346][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:12:19,348][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:12:19,350][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:12:20,391][__main__][INFO] - Iteration 406 took 53s (30.83% Gen, 67.20% Train). Generation: 16s, Training: 35s. Estimated remaining time: 38h 7m 29s. Estimated total time: 44h 10m 8s. Time estimates for 10 more iterations: 8m 50s, 100 more iterations: 1h 28m 20s, 500 more iterations: 7h 21m 41s. [2025-11-13 04:12:20,393][__main__][INFO] - Starting iteration 406. [2025-11-13 04:12:20,870][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 04:12:20,871][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:12:43,832][__main__][INFO] - Number of regex retries in iteration 406: 0 [2025-11-13 04:12:43,832][__main__][INFO] - agents played in iteration 406 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:12:44,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:12:44,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:12:44,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:12:44,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:12:44,772][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:12:44,773][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:12:45,494][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:12:45,955][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:12:46,460][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:12:46,959][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:12:47,463][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:12:47,965][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:12:48,467][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:12:48,968][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:12:49,471][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:12:49,979][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:12:50,480][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:12:50,983][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:12:51,483][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:12:51,986][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:12:52,487][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:12:52,988][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:12:53,489][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:12:53,996][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:12:54,501][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:12:55,008][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:12:55,510][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:12:56,009][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:12:56,521][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:12:57,021][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:12:57,521][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:12:58,029][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:12:58,528][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:12:59,040][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:12:59,539][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:13:00,037][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:13:00,550][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:13:01,048][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:13:01,546][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:13:02,046][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:13:02,546][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:13:03,057][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:13:03,559][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:13:04,059][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:13:04,560][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:13:05,060][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:13:05,565][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:13:06,063][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:13:06,567][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:13:07,067][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:13:07,570][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:13:08,072][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:13:08,572][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:13:09,072][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:13:09,579][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:13:10,078][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:13:10,578][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:13:11,077][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:13:11,581][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:13:12,087][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:13:12,589][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:13:13,090][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:13:13,590][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:13:14,093][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:13:14,596][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:13:15,102][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:13:15,602][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:13:16,105][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:13:16,609][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:13:17,114][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:13:17,619][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10140 tokens. [2025-11-13 04:13:18,425][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.04%, Current % of VRAM taken: 58.29%, Block Peak % of device VRAM: 62.01%, ΔTime: 00:00:32 [2025-11-13 04:13:19,194][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:13:19,196][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:13:19,199][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:13:20,187][__main__][INFO] - Iteration 407 took 59s (38.71% Gen, 59.62% Train). Generation: 22s, Training: 35s. Estimated remaining time: 43h 22m 13s. Estimated total time: 49h 25m 51s. Time estimates for 10 more iterations: 9m 53s, 100 more iterations: 1h 38m 51s, 500 more iterations: 8h 14m 18s. [2025-11-13 04:13:20,189][__main__][INFO] - Starting iteration 407. [2025-11-13 04:13:20,678][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 04:13:20,678][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:13:38,693][__main__][INFO] - Number of regex retries in iteration 407: 0 [2025-11-13 04:13:38,694][__main__][INFO] - agents played in iteration 407 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:13:39,483][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:13:39,518][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:13:39,546][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:13:39,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:13:39,577][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:13:39,579][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:13:40,308][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:13:40,764][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:13:41,279][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:13:41,780][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:13:42,282][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:13:42,788][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:13:43,289][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:13:43,792][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:13:44,293][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:13:44,792][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:13:45,295][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:13:45,796][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:13:46,298][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:13:46,803][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:13:47,306][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:13:47,805][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:13:48,308][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:13:48,811][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:13:49,311][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:13:49,811][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:13:50,315][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:13:50,818][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:13:51,321][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:13:51,839][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:13:52,341][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:13:52,860][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:13:53,361][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:13:53,882][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:13:54,387][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:13:54,887][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:13:55,387][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:13:55,888][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:13:56,389][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:13:56,892][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:13:57,392][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:13:57,891][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:13:58,391][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:13:58,889][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:13:59,397][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:13:59,899][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:14:00,402][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:14:00,903][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:14:01,404][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:14:01,906][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:14:02,406][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:14:02,907][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:14:03,412][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:14:03,919][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:14:04,420][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:14:04,929][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:14:05,430][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:14:05,946][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:14:06,444][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:14:06,945][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:14:07,455][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:14:07,959][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:14:08,467][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:14:08,972][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:14:09,481][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:14:09,988][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:14:10,494][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:14:10,999][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:14:11,505][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:14:12,007][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:14:12,511][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10121 tokens. [2025-11-13 04:14:13,332][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.16%, ΔTime: 00:00:33 [2025-11-13 04:14:14,079][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:14:14,081][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:14:14,082][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:14:15,177][__main__][INFO] - Iteration 408 took 54s (33.05% Gen, 64.93% Train). Generation: 18s, Training: 35s. Estimated remaining time: 39h 20m 26s. Estimated total time: 45h 24m 59s. Time estimates for 10 more iterations: 9m 4s, 100 more iterations: 1h 30m 49s, 500 more iterations: 7h 34m 9s. [2025-11-13 04:14:15,179][__main__][INFO] - Starting iteration 408. [2025-11-13 04:14:15,663][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 04:14:15,665][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:14:37,231][__main__][INFO] - Number of regex retries in iteration 408: 0 [2025-11-13 04:14:37,231][__main__][INFO] - agents played in iteration 408 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:14:38,073][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:14:38,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:14:38,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:14:38,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:14:38,152][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:14:38,153][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:14:38,947][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:14:39,410][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:14:39,922][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:14:40,424][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:14:40,926][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:14:41,427][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:14:41,931][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:14:42,439][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:14:42,949][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:14:43,450][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:14:43,952][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:14:44,451][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:14:44,973][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:14:45,474][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:14:45,975][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:14:46,482][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:14:46,984][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:14:47,492][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:14:47,998][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:14:48,529][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:14:49,047][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:14:49,551][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:14:50,055][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:14:50,560][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:14:51,065][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:14:51,579][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:14:52,081][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:14:52,587][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:14:53,090][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:14:53,597][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:14:54,099][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:14:54,608][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:14:55,109][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:14:55,609][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:14:56,109][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:14:56,610][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:14:57,117][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:14:57,623][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:14:58,126][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:14:58,633][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:14:59,133][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:14:59,641][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:15:00,144][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:15:00,647][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:15:01,152][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:15:01,656][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:15:02,164][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:15:02,669][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:15:03,170][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:15:03,672][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:15:04,174][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:15:04,676][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:15:05,184][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:15:05,690][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:15:06,199][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:15:06,709][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:15:07,217][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:15:07,728][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:15:08,236][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:15:08,745][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:15:09,251][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:15:09,757][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:15:10,279][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:15:10,784][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:15:11,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10169 tokens. [2025-11-13 04:15:12,131][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.49%, ΔTime: 00:00:33 [2025-11-13 04:15:12,917][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:15:12,919][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:15:12,921][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:15:13,911][__main__][INFO] - Iteration 409 took 58s (37.02% Gen, 61.27% Train). Generation: 21s, Training: 35s. Estimated remaining time: 42h 26m 54s. Estimated total time: 48h 32m 26s. Time estimates for 10 more iterations: 9m 42s, 100 more iterations: 1h 37m 4s, 500 more iterations: 8h 5m 24s. [2025-11-13 04:15:13,914][__main__][INFO] - Starting iteration 409. [2025-11-13 04:15:14,399][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 04:15:14,400][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:15:38,468][__main__][INFO] - Number of regex retries in iteration 409: 0 [2025-11-13 04:15:38,469][__main__][INFO] - agents played in iteration 409 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:15:39,303][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:15:39,327][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:15:39,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:15:39,374][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:15:39,375][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:15:39,375][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:15:40,130][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:15:40,598][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:15:41,109][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:15:41,620][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:15:42,132][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:15:42,639][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:15:43,146][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:15:43,657][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:15:44,163][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:15:44,666][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:15:45,172][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:15:45,696][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:15:46,199][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:15:46,702][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:15:47,204][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:15:47,710][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:15:48,219][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:15:48,724][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:15:49,226][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:15:49,732][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:15:50,234][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:15:50,741][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:15:51,245][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:15:51,751][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:15:52,254][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:15:52,760][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:15:53,264][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:15:53,775][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:15:54,280][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:15:54,792][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:15:55,295][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:15:55,801][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:15:56,311][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:15:56,817][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:15:57,320][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:15:57,825][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:15:58,334][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:15:58,839][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:15:59,346][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:15:59,858][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:16:00,377][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:16:00,881][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:16:01,384][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:16:01,889][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:16:02,391][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:16:02,902][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:16:03,404][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:16:03,904][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:16:04,413][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:16:04,918][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:16:05,419][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:16:05,931][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:16:06,437][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:16:06,944][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:16:07,453][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:16:07,958][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:16:08,474][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:16:08,982][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:16:09,503][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:16:10,007][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:16:10,519][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:16:11,025][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:16:11,527][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:16:12,037][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:16:12,545][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10193 tokens. [2025-11-13 04:16:13,376][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 04:16:14,102][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:16:14,103][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:16:14,105][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:16:15,122][__main__][INFO] - Iteration 410 took 1m 0s (39.63% Gen, 58.69% Train). Generation: 24s, Training: 35s. Estimated remaining time: 44h 29m 36s. Estimated total time: 50h 36m 9s. Time estimates for 10 more iterations: 10m 7s, 100 more iterations: 1h 41m 12s, 500 more iterations: 8h 26m 1s. [2025-11-13 04:16:15,124][__main__][INFO] - Starting iteration 410. [2025-11-13 04:16:15,595][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 40 and human policies 1. [2025-11-13 04:16:15,596][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:16:37,514][__main__][INFO] - Number of regex retries in iteration 410: 0 [2025-11-13 04:16:37,515][__main__][INFO] - agents played in iteration 410 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:16:38,407][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:16:38,431][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:16:38,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:16:38,479][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:16:38,480][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:16:38,480][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:16:39,259][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:16:39,730][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:16:40,237][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:16:40,742][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:16:41,247][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:16:41,756][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:16:42,267][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:16:42,772][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:16:43,277][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:16:43,782][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:16:44,284][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:16:44,790][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:16:45,301][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:16:45,804][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:16:46,315][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:16:46,818][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:16:47,321][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:16:47,826][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:16:48,338][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:16:48,844][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:16:49,347][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:16:49,849][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:16:50,354][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:16:50,855][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:16:51,362][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:16:51,866][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:16:52,369][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:16:52,876][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:16:53,379][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:16:53,884][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:16:54,401][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:16:54,905][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:16:55,418][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:16:55,922][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:16:56,426][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:16:56,939][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:16:57,442][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:16:57,946][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:16:58,450][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:16:58,960][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:16:59,467][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:16:59,969][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:17:00,475][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:17:00,991][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:17:01,493][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:17:02,009][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:17:02,510][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:17:03,016][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:17:03,519][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:17:04,024][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:17:04,530][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:17:05,032][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:17:05,536][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:17:06,045][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:17:06,554][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:17:07,064][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:17:07,574][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:17:08,081][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:17:08,626][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:17:09,142][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:17:09,648][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:17:10,161][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:17:10,669][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:17:11,174][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:17:11,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10212 tokens. [2025-11-13 04:17:12,565][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:33 [2025-11-13 04:17:13,258][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:17:13,261][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:17:13,263][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:17:15,182][__main__][INFO] - Iteration 411 took 59s (36.78% Gen, 59.99% Train). Generation: 21s, Training: 35s. Estimated remaining time: 43h 31m 49s. Estimated total time: 49h 39m 22s. Time estimates for 10 more iterations: 9m 55s, 100 more iterations: 1h 39m 18s, 500 more iterations: 8h 16m 33s. [2025-11-13 04:17:15,185][__main__][INFO] - Starting iteration 411. [2025-11-13 04:17:15,662][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 04:17:15,663][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:17:45,644][__main__][INFO] - Number of regex retries in iteration 411: 0 [2025-11-13 04:17:45,645][__main__][INFO] - agents played in iteration 411 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:17:46,505][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:17:46,536][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:17:46,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:17:46,592][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:17:46,592][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:17:46,594][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:17:47,379][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:17:47,841][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:17:48,352][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:17:48,863][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:17:49,369][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:17:49,873][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:17:50,378][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:17:50,884][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:17:51,396][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:17:51,902][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:17:52,406][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:17:52,912][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:17:53,433][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:17:53,936][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:17:54,439][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:17:54,952][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:17:55,463][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:17:55,971][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:17:56,484][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:17:56,991][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:17:57,511][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:17:58,019][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:17:58,527][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:17:59,035][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:17:59,543][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:18:00,051][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:18:00,560][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:18:01,067][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:18:01,570][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:18:02,070][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:18:02,573][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:18:03,075][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:18:03,578][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:18:04,086][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:18:04,589][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:18:05,096][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:18:05,606][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:18:06,109][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:18:06,622][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:18:07,132][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:18:07,639][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:18:08,144][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:18:08,647][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:18:09,152][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:18:09,661][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:18:10,170][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:18:10,677][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:18:11,185][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:18:11,692][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:18:12,198][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:18:12,702][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:18:13,225][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:18:13,735][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:18:14,241][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:18:14,751][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:18:15,259][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:18:15,769][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:18:16,277][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:18:16,788][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:18:17,318][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:18:17,825][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:18:18,332][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:18:18,844][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:18:19,350][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:18:19,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10172 tokens. [2025-11-13 04:18:20,693][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.11%, ΔTime: 00:00:33 [2025-11-13 04:18:21,459][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:18:21,463][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:18:21,464][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:18:22,521][__main__][INFO] - Iteration 412 took 1m 6s (44.84% Gen, 53.57% Train). Generation: 29s, Training: 35s. Estimated remaining time: 49h 34m 18s. Estimated total time: 55h 42m 59s. Time estimates for 10 more iterations: 11m 8s, 100 more iterations: 1h 51m 25s, 500 more iterations: 9h 17m 9s. [2025-11-13 04:18:22,523][__main__][INFO] - Starting iteration 412. [2025-11-13 04:18:23,006][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 04:18:23,007][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:18:45,290][__main__][INFO] - Number of regex retries in iteration 412: 0 [2025-11-13 04:18:45,291][__main__][INFO] - agents played in iteration 412 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:18:46,079][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:18:46,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:18:46,128][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:18:46,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:18:46,151][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:18:46,153][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:18:46,845][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:18:47,306][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:18:47,821][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:18:48,333][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:18:48,844][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:18:49,351][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:18:50,217][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:18:51,852][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:18:52,383][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:18:52,889][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:18:53,394][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:18:53,902][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:18:54,419][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:18:54,924][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:18:55,442][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:18:55,950][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:18:56,457][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:18:56,962][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:18:57,469][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:18:57,974][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:18:58,481][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:18:58,986][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:18:59,497][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:19:00,003][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:19:00,508][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:19:01,011][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:19:01,519][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:19:02,024][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:19:02,529][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:19:03,035][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:19:03,543][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:19:04,047][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:19:04,556][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:19:05,065][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:19:05,574][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:19:06,082][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:19:06,590][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:19:07,092][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:19:07,596][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:19:08,103][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:19:08,608][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:19:09,115][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:19:09,621][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:19:10,132][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:19:10,641][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:19:11,154][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:19:11,662][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:19:12,172][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:19:12,677][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:19:13,187][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:19:13,697][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:19:14,204][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:19:14,712][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:19:15,232][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:19:15,747][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:19:16,255][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:19:16,762][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:19:17,269][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:19:17,776][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:19:18,283][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:19:18,788][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:19:19,293][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:19:19,799][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:19:20,308][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:19:20,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10225 tokens. [2025-11-13 04:19:21,674][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.33%, ΔTime: 00:00:34 [2025-11-13 04:19:22,317][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:19:22,319][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:19:22,322][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:19:23,231][__main__][INFO] - Iteration 413 took 1m 0s (37.00% Gen, 61.49% Train). Generation: 22s, Training: 37s. Estimated remaining time: 44h 1m 33s. Estimated total time: 50h 11m 15s. Time estimates for 10 more iterations: 10m 2s, 100 more iterations: 1h 40m 22s, 500 more iterations: 8h 21m 52s. [2025-11-13 04:19:23,233][__main__][INFO] - Starting iteration 413. [2025-11-13 04:19:23,717][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 04:19:23,718][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:19:47,760][__main__][INFO] - Number of regex retries in iteration 413: 0 [2025-11-13 04:19:47,761][__main__][INFO] - agents played in iteration 413 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:19:48,645][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:19:48,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:19:48,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:19:48,722][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:19:48,723][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:19:48,724][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:19:49,511][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:19:49,972][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:19:50,486][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:19:50,985][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:19:51,485][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:19:51,991][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:19:52,490][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:19:52,994][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:19:53,501][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:19:54,000][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:19:54,508][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:19:55,010][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:19:55,515][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:19:56,014][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:19:56,513][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:19:57,025][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:19:57,523][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:19:58,024][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:19:58,527][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:19:59,028][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:19:59,546][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:20:00,046][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:20:00,548][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:20:01,049][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:20:01,551][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:20:02,051][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:20:02,549][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:20:03,049][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:20:03,564][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:20:04,069][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:20:04,574][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:20:05,077][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:20:05,580][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:20:06,085][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:20:06,593][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:20:07,103][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:20:07,609][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:20:08,112][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:20:08,614][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:20:09,119][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:20:09,623][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:20:10,149][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:20:10,652][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:20:11,163][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:20:11,670][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:20:12,176][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:20:12,680][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:20:13,188][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:20:13,692][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:20:14,199][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:20:14,705][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:20:15,208][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:20:15,715][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:20:16,224][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:20:16,729][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:20:17,233][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:20:17,736][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:20:18,241][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:20:18,744][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:20:19,254][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:20:19,764][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:20:20,269][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:20:20,776][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:20:21,285][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:20:21,790][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10113 tokens. [2025-11-13 04:20:22,653][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.07%, Current % of VRAM taken: 58.31%, Block Peak % of device VRAM: 62.06%, ΔTime: 00:00:33 [2025-11-13 04:20:23,410][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:20:23,413][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:20:23,415][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:20:24,432][__main__][INFO] - Iteration 414 took 1m 0s (39.60% Gen, 58.72% Train). Generation: 24s, Training: 35s. Estimated remaining time: 44h 25m 4s. Estimated total time: 50h 35m 47s. Time estimates for 10 more iterations: 10m 7s, 100 more iterations: 1h 41m 11s, 500 more iterations: 8h 25m 57s. [2025-11-13 04:20:24,434][__main__][INFO] - Starting iteration 414. [2025-11-13 04:20:24,913][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 04:20:24,913][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:20:45,187][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 04:20:54,505][__main__][INFO] - Number of regex retries in iteration 414: 1 [2025-11-13 04:20:54,506][__main__][INFO] - agents played in iteration 414 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:20:55,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:20:55,397][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:20:55,424][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:20:55,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:20:55,449][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:20:55,450][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:20:56,165][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:20:56,639][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:20:57,148][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:20:57,661][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:20:58,163][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:20:58,673][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:20:59,180][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:20:59,686][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:21:00,192][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:21:00,704][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:21:01,210][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:21:01,718][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:21:02,222][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:21:02,725][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:21:03,242][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:21:03,751][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:21:04,268][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:21:04,770][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:21:05,277][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:21:05,783][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:21:06,289][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:21:06,794][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:21:07,300][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:21:07,805][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:21:08,312][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:21:08,819][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:21:09,345][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:21:09,860][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:21:10,369][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:21:10,884][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:21:11,390][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:21:11,896][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:21:12,406][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:21:12,910][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:21:13,417][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:21:13,924][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:21:14,429][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:21:14,949][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:21:15,453][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:21:15,965][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:21:16,468][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:21:16,974][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:21:17,482][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:21:17,987][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:21:18,494][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:21:19,002][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:21:19,505][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:21:20,012][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:21:20,520][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:21:21,027][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:21:21,542][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:21:22,049][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:21:22,558][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:21:23,068][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:21:23,580][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:21:24,090][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:21:24,602][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:21:25,106][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:21:25,617][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:21:26,124][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:21:26,633][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:21:27,140][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:21:27,649][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:21:28,156][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:21:28,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10265 tokens. [2025-11-13 04:21:29,499][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.21%, ΔTime: 00:00:33 [2025-11-13 04:21:30,130][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:21:30,132][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:21:30,133][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:21:31,100][__main__][INFO] - Iteration 415 took 1m 6s (44.71% Gen, 53.83% Train). Generation: 29s, Training: 35s. Estimated remaining time: 48h 57m 35s. Estimated total time: 55h 9m 25s. Time estimates for 10 more iterations: 11m 1s, 100 more iterations: 1h 50m 18s, 500 more iterations: 9h 11m 34s. [2025-11-13 04:21:31,103][__main__][INFO] - Starting iteration 415. [2025-11-13 04:21:31,587][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 04:21:31,588][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:21:48,979][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 04:21:49,978][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 04:22:00,672][__main__][INFO] - Number of regex retries in iteration 415: 2 [2025-11-13 04:22:00,673][__main__][INFO] - agents played in iteration 415 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:22:01,527][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:22:01,554][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:22:01,581][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:22:01,603][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:22:01,604][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:22:01,605][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:22:02,327][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:22:02,787][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:22:03,292][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:22:03,800][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:22:04,306][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:22:04,807][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:22:05,308][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:22:05,808][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:22:06,309][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:22:06,809][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:22:07,311][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:22:07,812][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:22:08,315][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:22:08,820][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:22:09,325][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:22:09,832][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:22:10,333][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:22:10,833][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:22:11,337][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:22:11,837][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:22:12,341][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:22:12,857][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:22:13,363][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:22:13,864][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:22:14,370][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:22:14,883][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:22:15,394][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:22:15,901][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:22:16,408][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:22:16,915][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:22:17,423][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:22:17,930][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:22:18,440][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:22:18,948][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:22:19,450][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:22:19,956][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:22:20,461][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:22:20,969][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:22:21,471][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:22:21,991][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:22:22,495][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:22:23,004][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:22:23,507][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:22:24,014][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:22:24,521][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:22:25,027][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:22:25,531][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:22:26,034][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:22:26,535][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:22:27,036][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:22:27,551][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:22:28,053][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:22:28,566][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:22:29,076][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:22:29,581][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:22:30,092][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:22:30,600][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:22:31,110][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:22:31,616][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:22:32,126][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:22:32,633][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:22:33,137][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:22:33,645][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:22:34,163][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:22:34,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10138 tokens. [2025-11-13 04:22:35,553][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.37%, ΔTime: 00:00:33 [2025-11-13 04:22:36,307][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:22:36,309][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:22:36,311][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:22:37,372][__main__][INFO] - Iteration 416 took 1m 5s (44.21% Gen, 54.17% Train). Generation: 29s, Training: 35s. Estimated remaining time: 48h 36m 23s. Estimated total time: 54h 49m 18s. Time estimates for 10 more iterations: 10m 57s, 100 more iterations: 1h 49m 38s, 500 more iterations: 9h 8m 13s. [2025-11-13 04:22:37,375][__main__][INFO] - Starting iteration 416. [2025-11-13 04:22:37,882][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 04:22:37,883][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:23:01,163][__main__][INFO] - Number of regex retries in iteration 416: 0 [2025-11-13 04:23:01,164][__main__][INFO] - agents played in iteration 416 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:23:01,944][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:23:01,970][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:23:01,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:23:02,016][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:23:02,017][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:23:02,017][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:23:02,728][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:23:03,193][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:23:03,706][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:23:04,212][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:23:04,718][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:23:05,224][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:23:05,733][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:23:06,236][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:23:06,743][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:23:07,248][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:23:07,749][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:23:08,252][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:23:08,755][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:23:09,269][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:23:09,771][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:23:10,276][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:23:10,793][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:23:11,297][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:23:11,805][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:23:12,309][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:23:12,817][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:23:13,326][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:23:13,831][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:23:14,336][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:23:14,849][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:23:15,351][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:23:15,854][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:23:16,356][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:23:18,203][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:23:18,795][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:23:19,302][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:23:19,818][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:23:20,324][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:23:20,830][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:23:21,334][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:23:21,834][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:23:22,337][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:23:22,841][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:23:23,344][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:23:23,853][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:23:24,355][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:23:24,860][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:23:25,369][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:23:25,873][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:23:26,386][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:23:26,887][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:23:27,391][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:23:27,894][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:23:28,398][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:23:28,907][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:23:29,415][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:23:29,927][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:23:30,437][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:23:30,943][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:23:31,448][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:23:31,959][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:23:32,465][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:23:32,983][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:23:33,488][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:23:33,995][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:23:34,505][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:23:35,010][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:23:35,515][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:23:36,020][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:23:36,525][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10201 tokens. [2025-11-13 04:23:37,460][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:34 [2025-11-13 04:23:38,139][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:23:38,141][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:23:38,143][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:23:39,070][__main__][INFO] - Iteration 417 took 1m 1s (38.05% Gen, 60.43% Train). Generation: 23s, Training: 36s. Estimated remaining time: 44h 45m 29s. Estimated total time: 50h 59m 26s. Time estimates for 10 more iterations: 10m 11s, 100 more iterations: 1h 41m 58s, 500 more iterations: 8h 29m 54s. [2025-11-13 04:23:39,073][__main__][INFO] - Starting iteration 417. [2025-11-13 04:23:39,562][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 04:23:39,562][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:24:06,598][__main__][INFO] - Number of regex retries in iteration 417: 0 [2025-11-13 04:24:06,599][__main__][INFO] - agents played in iteration 417 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:24:07,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:24:07,567][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:24:07,593][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:24:07,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:24:07,617][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:24:07,618][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:24:08,328][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:24:08,787][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:24:09,296][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:24:09,806][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:24:10,313][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:24:10,818][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:24:11,321][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:24:11,824][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:24:12,330][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:24:12,833][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:24:13,337][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:24:13,848][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:24:14,351][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:24:14,856][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:24:15,362][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:24:15,869][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:24:16,390][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:24:16,897][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:24:17,405][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:24:17,908][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:24:18,415][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:24:18,924][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:24:19,427][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:24:19,930][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:24:20,436][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:24:20,941][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:24:21,446][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:24:21,947][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:24:22,449][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:24:22,952][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:24:23,459][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:24:23,963][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:24:24,475][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:24:24,979][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:24:25,482][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:24:25,987][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:24:26,490][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:24:27,004][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:24:27,507][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:24:28,010][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:24:28,516][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:24:29,023][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:24:29,529][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:24:30,038][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:24:30,544][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:24:31,051][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:24:31,553][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:24:32,059][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:24:32,562][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:24:33,064][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:24:33,585][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:24:34,089][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:24:34,591][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:24:35,103][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:24:35,607][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:24:36,113][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:24:36,621][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:24:37,129][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:24:37,638][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:24:38,144][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:24:38,650][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:24:39,153][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:24:39,662][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:24:40,184][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:24:40,690][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10114 tokens. [2025-11-13 04:24:41,538][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 62.22%, ΔTime: 00:00:33 [2025-11-13 04:24:42,277][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:24:42,279][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:24:42,281][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:24:43,235][__main__][INFO] - Iteration 418 took 1m 3s (42.46% Gen, 56.04% Train). Generation: 27s, Training: 35s. Estimated remaining time: 46h 48m 40s. Estimated total time: 53h 3m 42s. Time estimates for 10 more iterations: 10m 36s, 100 more iterations: 1h 46m 7s, 500 more iterations: 8h 50m 37s. [2025-11-13 04:24:43,237][__main__][INFO] - Starting iteration 418. [2025-11-13 04:24:43,729][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 04:24:43,730][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:25:06,881][__main__][INFO] - Number of regex retries in iteration 418: 0 [2025-11-13 04:25:06,881][__main__][INFO] - agents played in iteration 418 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:25:07,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:25:07,738][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:25:07,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:25:07,784][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:25:07,785][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:25:07,785][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:25:08,481][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:25:08,936][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:25:09,444][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:25:09,951][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:25:10,453][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:25:10,951][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:25:11,452][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:25:11,961][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:25:12,469][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:25:12,980][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:25:13,482][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:25:13,986][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:25:14,505][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:25:15,007][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:25:15,508][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:25:16,009][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:25:16,509][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:25:17,013][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:25:17,513][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:25:18,013][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:25:18,513][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:25:19,014][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:25:19,514][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:25:20,018][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:25:20,522][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:25:21,026][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:25:21,531][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:25:22,039][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:25:22,544][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:25:23,052][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:25:23,556][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:25:24,061][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:25:24,566][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:25:25,087][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:25:25,589][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:25:26,096][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:25:26,599][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:25:27,106][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:25:27,616][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:25:28,124][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:25:28,626][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:25:29,130][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:25:29,633][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:25:30,136][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:25:30,639][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:25:31,142][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:25:31,646][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:25:32,150][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:25:32,651][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:25:33,160][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:25:33,669][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:25:34,184][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:25:34,688][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:25:35,205][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:25:35,708][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:25:36,214][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:25:36,720][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:25:37,224][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:25:37,728][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:25:38,233][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:25:38,744][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:25:39,248][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:25:41,303][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:25:41,817][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:25:42,325][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10162 tokens. [2025-11-13 04:25:43,198][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.17%, ΔTime: 00:00:34 [2025-11-13 04:25:43,849][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:25:43,851][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:25:43,854][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:25:44,772][__main__][INFO] - Iteration 419 took 1m 1s (37.92% Gen, 60.57% Train). Generation: 23s, Training: 36s. Estimated remaining time: 44h 36m 10s. Estimated total time: 50h 52m 13s. Time estimates for 10 more iterations: 10m 10s, 100 more iterations: 1h 41m 44s, 500 more iterations: 8h 28m 42s. [2025-11-13 04:25:44,775][__main__][INFO] - Starting iteration 419. [2025-11-13 04:25:45,262][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 04:25:45,262][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:26:15,954][__main__][INFO] - Number of regex retries in iteration 419: 0 [2025-11-13 04:26:15,955][__main__][INFO] - agents played in iteration 419 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:26:16,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:26:16,836][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:26:16,858][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:26:16,880][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:26:16,881][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:26:16,882][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:26:17,610][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:26:18,066][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:26:18,580][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:26:19,085][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:26:19,585][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:26:20,109][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:26:20,609][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:26:21,108][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:26:21,608][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:26:22,108][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:26:22,613][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:26:23,116][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:26:23,622][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:26:24,125][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:26:24,627][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:26:25,130][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:26:25,630][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:26:26,138][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:26:26,644][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:26:27,147][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:26:27,651][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:26:28,153][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:26:28,660][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:26:29,165][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:26:29,676][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:26:30,180][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:26:30,688][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:26:31,197][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:26:31,704][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:26:32,208][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:26:32,711][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:26:33,218][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:26:33,726][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:26:34,231][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:26:34,736][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:26:35,239][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:26:35,748][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:26:36,254][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:26:36,762][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:26:37,268][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:26:37,777][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:26:38,284][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:26:38,801][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:26:39,306][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:26:39,815][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:26:40,320][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:26:40,825][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:26:41,337][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:26:41,845][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:26:42,350][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:26:42,855][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:26:43,364][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:26:43,868][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:26:44,379][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:26:44,883][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:26:45,387][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:26:45,894][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:26:46,415][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:26:46,919][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:26:47,425][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:26:47,930][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:26:48,437][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:26:48,944][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:26:49,449][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:26:49,953][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10143 tokens. [2025-11-13 04:26:50,807][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.06%, Current % of VRAM taken: 58.30%, Block Peak % of device VRAM: 62.15%, ΔTime: 00:00:33 [2025-11-13 04:26:51,595][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:26:51,597][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:26:51,599][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:26:52,579][__main__][INFO] - Iteration 420 took 1m 7s (45.59% Gen, 52.95% Train). Generation: 30s, Training: 35s. Estimated remaining time: 49h 48m 42s. Estimated total time: 56h 5m 53s. Time estimates for 10 more iterations: 11m 13s, 100 more iterations: 1h 52m 11s, 500 more iterations: 9h 20m 58s. [2025-11-13 04:26:52,581][__main__][INFO] - Starting iteration 420. [2025-11-13 04:26:53,070][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 41 and human policies 1. [2025-11-13 04:26:53,070][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:27:18,021][__main__][INFO] - Number of regex retries in iteration 420: 0 [2025-11-13 04:27:18,021][__main__][INFO] - agents played in iteration 420 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:27:18,794][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:27:18,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:27:18,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:27:18,866][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:27:18,866][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:27:18,867][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:27:19,572][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:27:20,029][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:27:20,540][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:27:21,040][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:27:21,541][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:27:22,045][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:27:22,544][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:27:23,044][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:27:23,545][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:27:24,047][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:27:24,554][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:27:25,057][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:27:25,558][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:27:26,062][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:27:26,564][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:27:27,067][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:27:27,570][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:27:28,070][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:27:28,568][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:27:29,068][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:27:29,570][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:27:30,071][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:27:30,568][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:27:31,068][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:27:31,569][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:27:32,073][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:27:32,585][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:27:33,091][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:27:33,595][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:27:34,099][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:27:34,606][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:27:35,112][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:27:35,616][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:27:36,120][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:27:36,631][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:27:37,134][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:27:37,639][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:27:38,147][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:27:38,658][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:27:39,165][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:27:39,669][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:27:40,176][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:27:40,680][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:27:41,186][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:27:41,695][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:27:42,205][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:27:42,709][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:27:43,227][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:27:43,735][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:27:44,241][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:27:44,747][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:27:45,253][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:27:45,761][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:27:46,268][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:27:46,773][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:27:49,175][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:27:49,733][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:27:50,237][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:27:50,742][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:27:51,249][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:27:51,753][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:27:52,258][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:27:52,763][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:27:53,270][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:27:53,776][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10095 tokens. [2025-11-13 04:27:54,658][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.19%, ΔTime: 00:00:35 [2025-11-13 04:27:55,315][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:27:55,317][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:27:55,319][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:27:57,180][__main__][INFO] - Iteration 421 took 1m 4s (38.92% Gen, 58.18% Train). Generation: 24s, Training: 37s. Estimated remaining time: 47h 7m 17s. Estimated total time: 53h 25m 33s. Time estimates for 10 more iterations: 10m 41s, 100 more iterations: 1h 46m 51s, 500 more iterations: 8h 54m 15s. [2025-11-13 04:27:57,184][__main__][INFO] - Starting iteration 421. [2025-11-13 04:27:57,667][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 04:27:57,668][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:28:20,326][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 04:28:24,735][__main__][INFO] - Number of regex retries in iteration 421: 1 [2025-11-13 04:28:24,736][__main__][INFO] - agents played in iteration 421 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:28:25,730][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:28:25,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:28:25,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:28:25,810][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:28:25,811][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:28:25,811][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:28:26,532][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:28:26,994][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:28:27,513][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:28:28,020][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:28:28,522][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:28:29,037][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:28:29,540][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:28:30,043][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:28:30,549][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:28:31,055][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:28:31,554][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:28:32,053][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:28:32,553][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:28:33,059][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:28:33,563][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:28:34,063][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:28:34,563][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:28:35,066][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:28:35,575][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:28:36,078][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:28:36,583][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:28:37,097][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:28:37,597][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:28:38,107][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:28:38,616][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:28:39,118][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:28:39,636][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:28:40,146][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:28:40,656][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:28:41,159][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:28:41,663][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:28:42,173][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:28:42,676][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:28:43,190][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:28:43,698][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:28:44,206][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:28:44,734][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:28:45,243][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:28:45,752][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:28:46,259][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:28:46,769][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:28:47,274][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:28:47,782][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:28:48,289][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:28:48,811][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:28:49,323][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:28:49,834][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:28:50,345][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:28:50,854][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:28:51,366][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:28:51,873][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:28:52,383][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:28:52,891][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:28:53,398][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:28:53,924][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:28:54,434][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:28:54,946][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:28:55,453][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:28:55,958][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:28:56,465][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:28:56,972][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:28:57,478][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:28:57,987][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:28:58,493][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:28:59,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10094 tokens. [2025-11-13 04:28:59,851][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:33 [2025-11-13 04:29:00,626][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:29:00,630][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:29:00,632][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:29:01,730][__main__][INFO] - Iteration 422 took 1m 4s (42.25% Gen, 56.03% Train). Generation: 27s, Training: 35s. Estimated remaining time: 47h 3m 52s. Estimated total time: 53h 23m 12s. Time estimates for 10 more iterations: 10m 40s, 100 more iterations: 1h 46m 46s, 500 more iterations: 8h 53m 52s. [2025-11-13 04:29:01,732][__main__][INFO] - Starting iteration 422. [2025-11-13 04:29:02,203][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 04:29:02,204][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:29:27,098][__main__][INFO] - Number of regex retries in iteration 422: 0 [2025-11-13 04:29:27,099][__main__][INFO] - agents played in iteration 422 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:29:27,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:29:27,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:29:27,958][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:29:27,981][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:29:27,981][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:29:27,982][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:29:28,689][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:29:29,151][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:29:29,654][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:29:30,157][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:29:30,668][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:29:31,168][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:29:31,675][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:29:32,175][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:29:32,675][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:29:33,179][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:29:33,683][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:29:34,186][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:29:34,699][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:29:35,200][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:29:35,706][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:29:36,207][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:29:36,707][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:29:37,212][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:29:37,734][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:29:38,236][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:29:38,739][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:29:39,241][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:29:39,763][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:29:40,267][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:29:40,770][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:29:41,280][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:29:41,785][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:29:42,290][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:29:42,794][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:29:43,300][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:29:43,824][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:29:44,326][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:29:44,833][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:29:45,342][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:29:45,847][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:29:46,355][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:29:46,862][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:29:47,372][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:29:47,887][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:29:48,392][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:29:48,902][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:29:49,409][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:29:49,911][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:29:50,436][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:29:50,942][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:29:51,448][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:29:51,954][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:29:52,462][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:29:52,971][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:29:53,475][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:29:53,983][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:29:54,491][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:29:54,997][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:29:57,255][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:29:57,779][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:29:58,296][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:29:58,803][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:29:59,307][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:29:59,825][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:30:00,329][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:30:00,841][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:30:01,347][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:30:01,852][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:30:02,361][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:30:02,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10190 tokens. [2025-11-13 04:30:03,761][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.18%, ΔTime: 00:00:35 [2025-11-13 04:30:04,420][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:30:04,421][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:30:04,423][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:30:05,327][__main__][INFO] - Iteration 423 took 1m 3s (39.44% Gen, 59.13% Train). Generation: 24s, Training: 37s. Estimated remaining time: 46h 15m 50s. Estimated total time: 52h 36m 13s. Time estimates for 10 more iterations: 10m 31s, 100 more iterations: 1h 45m 12s, 500 more iterations: 8h 46m 2s. [2025-11-13 04:30:05,330][__main__][INFO] - Starting iteration 423. [2025-11-13 04:30:05,814][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 04:30:05,815][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:30:41,414][__main__][INFO] - Number of regex retries in iteration 423: 0 [2025-11-13 04:30:41,414][__main__][INFO] - agents played in iteration 423 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:30:42,268][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:30:42,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:30:42,323][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:30:42,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:30:42,346][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:30:42,347][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:30:43,091][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:30:43,556][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:30:44,062][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:30:44,571][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:30:45,074][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:30:45,591][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:30:46,100][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:30:46,606][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:30:47,115][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:30:47,624][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:30:48,133][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:30:48,642][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:30:49,151][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:30:49,663][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:30:50,168][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:30:50,673][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:30:51,184][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:30:51,695][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:30:52,199][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:30:52,706][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:30:53,215][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:30:53,726][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:30:54,235][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:30:54,744][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:30:55,251][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:30:55,755][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:30:56,265][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:30:56,769][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:30:57,274][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:30:57,782][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:30:58,288][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:30:58,791][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:30:59,299][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:30:59,803][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:31:00,312][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:31:00,826][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:31:01,331][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:31:01,839][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:31:02,349][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:31:02,872][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:31:03,378][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:31:03,885][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:31:04,396][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:31:04,903][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:31:05,411][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:31:05,916][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:31:06,425][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:31:06,937][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:31:07,443][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:31:07,953][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:31:08,461][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:31:08,980][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:31:09,493][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:31:10,000][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:31:10,510][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:31:11,023][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:31:11,531][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:31:12,039][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:31:12,550][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:31:13,057][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:31:13,574][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:31:14,082][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:31:14,588][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:31:15,094][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:31:15,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10212 tokens. [2025-11-13 04:31:16,433][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:33 [2025-11-13 04:31:17,211][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:31:17,214][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:31:17,216][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:31:18,258][__main__][INFO] - Iteration 424 took 1m 12s (49.14% Gen, 49.42% Train). Generation: 35s, Training: 35s. Estimated remaining time: 54h 0m 36s. Estimated total time: 60h 22m 13s. Time estimates for 10 more iterations: 12m 4s, 100 more iterations: 2h 0m 44s, 500 more iterations: 10h 3m 42s. [2025-11-13 04:31:18,260][__main__][INFO] - Starting iteration 424. [2025-11-13 04:31:18,733][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 04:31:18,733][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:31:40,001][__main__][INFO] - Number of regex retries in iteration 424: 0 [2025-11-13 04:31:40,001][__main__][INFO] - agents played in iteration 424 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:31:40,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:31:40,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:31:40,819][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:31:40,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:31:40,842][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:31:40,843][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:31:41,573][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:31:42,034][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:31:42,551][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:31:43,062][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:31:43,569][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:31:44,077][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:31:44,582][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:31:45,088][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:31:45,592][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:31:46,092][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:31:46,594][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:31:47,095][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:31:47,598][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:31:48,100][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:31:48,601][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:31:49,104][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:31:49,606][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:31:50,109][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:31:50,612][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:31:51,118][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:31:51,621][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:31:52,123][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:31:52,626][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:31:53,144][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:31:53,647][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:31:54,154][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:31:54,658][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:31:55,163][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:31:55,673][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:31:56,180][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:31:56,689][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:31:57,198][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:31:57,708][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:31:58,212][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:31:58,716][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:31:59,223][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:31:59,729][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:32:00,236][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:32:00,740][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:32:01,245][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:32:01,753][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:32:02,275][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:32:02,778][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:32:03,286][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:32:03,790][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:32:04,294][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:32:04,805][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:32:05,308][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:32:05,814][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:32:06,323][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:32:06,825][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:32:07,331][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:32:07,837][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:32:08,344][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:32:08,865][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:32:09,368][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:32:09,874][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:32:10,379][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:32:10,884][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:32:11,398][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:32:11,901][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:32:12,403][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:32:12,909][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:32:13,452][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:32:15,202][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10115 tokens. [2025-11-13 04:32:16,040][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.07%, Current % of VRAM taken: 58.32%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:34 [2025-11-13 04:32:17,260][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:32:17,263][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:32:17,265][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:32:18,228][__main__][INFO] - Iteration 425 took 59s (35.75% Gen, 62.63% Train). Generation: 21s, Training: 37s. Estimated remaining time: 43h 12m 12s. Estimated total time: 49h 34m 48s. Time estimates for 10 more iterations: 9m 54s, 100 more iterations: 1h 39m 9s, 500 more iterations: 8h 15m 48s. [2025-11-13 04:32:18,231][__main__][INFO] - Starting iteration 425. [2025-11-13 04:32:18,714][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 04:32:18,714][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:32:43,374][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 04:32:53,476][__main__][INFO] - Number of regex retries in iteration 425: 1 [2025-11-13 04:32:53,477][__main__][INFO] - agents played in iteration 425 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:32:54,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:32:54,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:32:54,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:32:54,403][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:32:54,404][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:32:54,404][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:32:55,201][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:32:55,753][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:32:56,271][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:32:56,777][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:32:57,286][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:32:57,796][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:32:58,305][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:32:58,816][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:32:59,322][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:32:59,826][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:33:00,342][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:33:00,848][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:33:01,366][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:33:01,870][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:33:02,379][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:33:02,886][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:33:03,389][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:33:03,895][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:33:04,403][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:33:04,909][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:33:05,417][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:33:05,925][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:33:06,430][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:33:06,947][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:33:07,458][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:33:07,971][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:33:08,481][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:33:08,988][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:33:09,500][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:33:10,008][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:33:10,516][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:33:11,021][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:33:11,526][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:33:12,045][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:33:12,550][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:33:13,066][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:33:13,576][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:33:14,082][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:33:14,591][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:33:15,096][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:33:15,604][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:33:16,109][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:33:16,617][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:33:17,124][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:33:17,632][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:33:18,142][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:33:18,650][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:33:19,159][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:33:19,666][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:33:20,171][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:33:20,678][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:33:21,178][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:33:21,683][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:33:22,191][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:33:22,695][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:33:23,197][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:33:23,704][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:33:24,208][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:33:24,713][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:33:25,221][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:33:25,722][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:33:26,224][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:33:26,727][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:33:27,230][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:33:27,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10093 tokens. [2025-11-13 04:33:28,500][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 04:33:29,257][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:33:29,259][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:33:29,260][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:33:30,264][__main__][INFO] - Iteration 426 took 1m 11s (48.58% Gen, 50.01% Train). Generation: 34s, Training: 35s. Estimated remaining time: 53h 13m 43s. Estimated total time: 59h 37m 31s. Time estimates for 10 more iterations: 11m 55s, 100 more iterations: 1h 59m 15s, 500 more iterations: 9h 56m 15s. [2025-11-13 04:33:30,266][__main__][INFO] - Starting iteration 426. [2025-11-13 04:33:30,735][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 04:33:30,736][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:33:51,297][__main__][INFO] - Number of regex retries in iteration 426: 0 [2025-11-13 04:33:51,297][__main__][INFO] - agents played in iteration 426 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:33:52,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:33:52,201][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:33:52,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:33:52,246][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:33:52,247][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:33:52,247][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:33:52,954][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:33:53,410][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:33:53,918][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:33:54,417][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:33:54,920][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:33:55,429][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:33:55,976][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:33:56,481][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:33:56,983][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:33:57,482][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:33:57,991][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:33:58,490][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:33:58,996][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:33:59,499][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:33:59,998][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:34:00,497][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:34:01,005][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:34:01,506][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:34:02,011][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:34:02,522][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:34:03,024][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:34:03,525][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:34:04,029][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:34:04,541][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:34:05,045][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:34:05,550][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:34:06,099][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:34:06,609][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:34:07,117][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:34:07,623][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:34:08,137][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:34:08,649][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:34:09,157][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:34:09,664][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:34:10,169][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:34:10,683][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:34:11,194][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:34:11,700][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:34:12,207][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:34:12,713][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:34:13,219][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:34:13,727][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:34:14,235][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:34:14,745][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:34:15,260][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:34:15,766][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:34:16,279][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:34:16,786][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:34:17,292][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:34:17,796][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:34:18,303][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:34:18,806][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:34:19,310][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:34:19,812][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:34:20,316][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:34:20,819][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:34:21,322][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:34:21,826][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:34:22,333][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:34:22,839][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:34:23,386][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:34:25,466][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:34:25,998][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:34:26,502][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:34:27,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10121 tokens. [2025-11-13 04:34:27,811][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 62.17%, ΔTime: 00:00:34 [2025-11-13 04:34:28,498][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:34:28,501][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:34:28,503][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:34:29,515][__main__][INFO] - Iteration 427 took 58s (34.98% Gen, 63.30% Train). Generation: 20s, Training: 37s. Estimated remaining time: 42h 34m 13s. Estimated total time: 48h 59m 1s. Time estimates for 10 more iterations: 9m 47s, 100 more iterations: 1h 37m 58s, 500 more iterations: 8h 9m 50s. [2025-11-13 04:34:29,517][__main__][INFO] - Starting iteration 427. [2025-11-13 04:34:30,012][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 04:34:30,013][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:34:49,457][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 04:35:01,340][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 0 balls This proposal might be suboptimal given the values, but since Alice values balls more than you do, proposing to take all balls ensures you get some points from the low-value items. A more optimal strategy might be to propose splitting the books and balls, but given the values and Alice's strategy, this might be the safer approach. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 04:35:02,580][__main__][INFO] - Number of regex retries in iteration 427: 2 [2025-11-13 04:35:02,581][__main__][INFO] - agents played in iteration 427 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:35:03,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:35:03,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:35:03,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:35:03,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:35:03,548][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:35:03,549][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:35:04,323][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:35:04,789][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:35:05,309][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:35:05,816][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:35:06,327][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:35:06,835][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:35:07,339][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:35:07,848][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:35:08,355][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:35:08,861][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:35:09,372][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:35:09,877][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:35:10,385][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:35:10,890][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:35:11,395][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:35:11,906][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:35:12,414][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:35:12,918][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:35:13,424][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:35:13,929][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:35:14,438][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:35:14,945][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:35:15,450][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:35:15,970][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:35:16,474][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:35:16,990][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:35:17,494][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:35:18,002][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:35:18,514][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:35:19,020][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:35:19,526][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:35:20,032][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:35:20,536][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:35:21,044][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:35:21,552][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:35:22,058][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:35:22,583][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:35:23,091][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:35:23,601][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:35:24,106][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:35:24,616][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:35:25,130][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:35:25,637][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:35:26,147][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:35:26,651][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:35:27,158][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:35:27,688][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:35:28,198][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:35:28,708][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:35:29,220][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:35:29,729][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:35:30,235][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:35:30,744][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:35:31,250][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:35:31,756][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:35:32,260][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:35:32,767][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:35:33,269][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:35:33,777][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:35:34,297][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:35:34,803][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:35:35,308][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:35:35,816][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:35:36,323][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:35:36,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10210 tokens. [2025-11-13 04:35:37,687][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 04:35:38,463][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:35:38,464][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:35:38,466][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:35:39,538][__main__][INFO] - Iteration 428 took 1m 9s (46.84% Gen, 51.61% Train). Generation: 32s, Training: 35s. Estimated remaining time: 51h 30m 21s. Estimated total time: 57h 56m 19s. Time estimates for 10 more iterations: 11m 35s, 100 more iterations: 1h 55m 52s, 500 more iterations: 9h 39m 23s. [2025-11-13 04:35:39,540][__main__][INFO] - Starting iteration 428. [2025-11-13 04:35:40,013][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 04:35:40,013][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:36:07,497][__main__][INFO] - Number of regex retries in iteration 428: 0 [2025-11-13 04:36:07,498][__main__][INFO] - agents played in iteration 428 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:36:08,286][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:36:08,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:36:08,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:36:08,353][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:36:08,353][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:36:08,354][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:36:09,055][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:36:09,514][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:36:10,029][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:36:10,540][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:36:11,060][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:36:11,569][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:36:12,075][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:36:12,590][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:36:13,102][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:36:13,603][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:36:14,106][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:36:14,608][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:36:15,118][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:36:15,620][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:36:16,125][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:36:16,634][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:36:17,142][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:36:17,650][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:36:18,157][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:36:18,668][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:36:19,185][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:36:19,694][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:36:20,210][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:36:20,720][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:36:21,231][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:36:21,746][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:36:22,258][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:36:22,765][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:36:23,273][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:36:23,780][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:36:24,291][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:36:24,799][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:36:25,306][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:36:25,825][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:36:26,335][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:36:26,842][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:36:27,350][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:36:27,857][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:36:28,372][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:36:28,875][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:36:29,383][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:36:29,891][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:36:30,397][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:36:30,914][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:36:31,419][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:36:31,929][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:36:32,433][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:36:32,941][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:36:33,447][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:36:33,975][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:36:34,490][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:36:34,997][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:36:35,505][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:36:36,016][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:36:36,523][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:36:37,029][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:36:37,538][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:36:38,040][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:36:38,545][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:36:39,050][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:36:39,552][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:36:40,056][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:36:40,557][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:36:41,061][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:36:41,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10261 tokens. [2025-11-13 04:36:42,377][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.27%, Current % of VRAM taken: 58.51%, Block Peak % of device VRAM: 62.38%, ΔTime: 00:00:33 [2025-11-13 04:36:43,156][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:36:43,158][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:36:43,159][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:36:44,232][__main__][INFO] - Iteration 429 took 1m 4s (42.80% Gen, 55.53% Train). Generation: 27s, Training: 35s. Estimated remaining time: 47h 3m 57s. Estimated total time: 53h 30m 59s. Time estimates for 10 more iterations: 10m 42s, 100 more iterations: 1h 47m 1s, 500 more iterations: 8h 55m 9s. [2025-11-13 04:36:44,234][__main__][INFO] - Starting iteration 429. [2025-11-13 04:36:44,702][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 04:36:44,702][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:37:11,501][__main__][INFO] - Number of regex retries in iteration 429: 0 [2025-11-13 04:37:11,503][__main__][INFO] - agents played in iteration 429 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:37:12,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:37:12,416][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:37:12,446][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:37:12,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:37:12,471][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:37:12,472][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:37:13,291][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:37:13,758][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:37:14,275][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:37:14,790][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:37:15,304][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:37:15,810][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:37:16,322][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:37:16,835][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:37:17,339][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:37:17,849][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:37:18,354][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:37:18,864][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:37:19,373][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:37:19,880][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:37:20,385][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:37:20,894][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:37:21,402][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:37:21,913][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:37:22,423][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:37:22,929][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:37:23,436][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:37:23,944][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:37:24,453][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:37:24,965][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:37:25,475][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:37:26,001][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:37:26,509][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:37:27,017][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:37:27,525][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:37:28,033][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:37:28,544][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:37:29,048][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:37:29,557][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:37:30,070][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:37:30,581][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:37:31,101][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:37:31,607][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:37:32,117][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:37:32,631][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:37:33,135][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:37:33,640][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:37:34,146][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:37:34,652][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:37:35,156][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:37:35,660][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:37:36,167][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:37:36,685][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:37:37,191][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:37:37,708][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:37:38,215][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:37:38,719][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:37:39,228][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:37:39,733][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:37:40,238][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:37:40,744][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:37:41,248][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:37:41,751][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:37:42,255][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:37:42,765][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:37:43,271][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:37:43,773][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:37:44,281][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:37:44,788][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:37:45,292][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:37:45,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10217 tokens. [2025-11-13 04:37:46,590][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.50%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:33 [2025-11-13 04:37:47,223][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:37:47,225][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:37:47,227][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:37:48,156][__main__][INFO] - Iteration 430 took 1m 3s (42.24% Gen, 56.30% Train). Generation: 26s, Training: 35s. Estimated remaining time: 46h 24m 37s. Estimated total time: 52h 52m 44s. Time estimates for 10 more iterations: 10m 34s, 100 more iterations: 1h 45m 45s, 500 more iterations: 8h 48m 47s. [2025-11-13 04:37:48,158][__main__][INFO] - Starting iteration 430. [2025-11-13 04:37:48,623][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 42 and human policies 1. [2025-11-13 04:37:48,624][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:38:05,976][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 04:38:05,981][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 04:38:16,469][__main__][INFO] - Number of regex retries in iteration 430: 2 [2025-11-13 04:38:16,470][__main__][INFO] - agents played in iteration 430 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:38:17,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:38:17,315][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:38:17,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:38:17,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:38:17,362][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:38:17,363][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:38:18,149][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:38:18,615][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:38:19,129][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:38:19,637][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:38:20,151][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:38:20,663][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:38:21,174][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:38:21,686][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:38:22,199][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:38:22,715][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:38:23,220][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:38:23,726][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:38:24,231][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:38:24,738][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:38:25,248][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:38:25,760][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:38:26,267][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:38:26,786][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:38:27,300][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:38:27,807][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:38:28,317][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:38:28,829][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:38:29,346][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:38:29,850][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:38:30,356][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:38:30,868][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:38:31,376][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:38:31,885][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:38:32,409][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:38:32,921][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:38:33,442][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:38:33,949][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:38:34,461][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:38:34,972][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:38:35,477][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:38:35,981][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:38:36,486][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:38:36,990][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:38:37,510][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:38:38,019][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:38:38,533][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:38:39,040][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:38:39,545][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:38:40,052][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:38:40,562][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:38:41,070][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:38:41,575][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:38:42,083][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:38:42,589][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:38:43,099][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:38:43,607][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:38:44,122][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:38:44,632][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:38:45,165][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:38:45,674][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:38:46,180][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:38:46,688][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:38:47,198][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:38:47,711][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:38:48,222][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:38:48,733][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:38:49,245][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:38:49,750][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:38:50,256][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:38:50,761][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10250 tokens. [2025-11-13 04:38:51,487][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.33%, ΔTime: 00:00:33 [2025-11-13 04:38:52,216][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:38:52,217][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:38:52,219][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:38:54,231][__main__][INFO] - Iteration 431 took 1m 5s (42.44% Gen, 54.49% Train). Generation: 27s, Training: 35s. Estimated remaining time: 48h 11m 11s. Estimated total time: 54h 40m 24s. Time estimates for 10 more iterations: 10m 56s, 100 more iterations: 1h 49m 20s, 500 more iterations: 9h 6m 44s. [2025-11-13 04:38:54,233][__main__][INFO] - Starting iteration 431. [2025-11-13 04:38:54,698][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 04:38:54,699][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:39:27,205][__main__][INFO] - Number of regex retries in iteration 431: 0 [2025-11-13 04:39:27,208][__main__][INFO] - agents played in iteration 431 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:39:28,092][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:39:28,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:39:28,139][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:39:28,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:39:28,163][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:39:28,164][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:39:29,004][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:39:29,474][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:39:29,987][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:39:30,493][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:39:31,002][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:39:31,515][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:39:32,024][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:39:32,531][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:39:33,042][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:39:33,551][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:39:34,060][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:39:34,566][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:39:35,072][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:39:35,584][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:39:36,089][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:39:36,597][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:39:37,104][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:39:37,612][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:39:38,133][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:39:38,639][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:39:39,147][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:39:39,653][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:39:40,156][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:39:40,667][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:39:41,178][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:39:41,686][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:39:42,191][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:39:42,697][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:39:43,206][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:39:43,708][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:39:44,210][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:39:44,727][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:39:45,230][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:39:45,744][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:39:46,246][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:39:46,749][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:39:47,263][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:39:47,772][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:39:48,276][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:39:48,783][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:39:49,288][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:39:49,794][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:39:50,299][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:39:50,809][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:39:51,343][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:39:51,848][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:39:52,354][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:39:52,861][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:39:53,362][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:39:53,866][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:39:54,368][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:39:54,872][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:39:55,377][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:39:55,881][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:39:56,383][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:39:56,884][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:39:57,384][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:39:57,885][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:39:58,389][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:39:58,894][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:39:59,397][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:39:59,898][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:40:00,404][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:40:00,916][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:40:01,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10220 tokens. [2025-11-13 04:40:02,186][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 04:40:02,829][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:40:02,831][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:40:02,833][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:40:03,799][__main__][INFO] - Iteration 432 took 1m 9s (47.05% Gen, 51.56% Train). Generation: 32s, Training: 35s. Estimated remaining time: 51h 4m 41s. Estimated total time: 57h 35m 3s. Time estimates for 10 more iterations: 11m 31s, 100 more iterations: 1h 55m 10s, 500 more iterations: 9h 35m 50s. [2025-11-13 04:40:03,801][__main__][INFO] - Starting iteration 432. [2025-11-13 04:40:04,291][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 04:40:04,291][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:40:27,611][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 04:40:33,326][__main__][INFO] - Number of regex retries in iteration 432: 1 [2025-11-13 04:40:33,327][__main__][INFO] - agents played in iteration 432 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:40:34,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:40:34,172][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:40:34,196][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:40:34,219][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:40:34,221][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:40:34,221][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:40:35,052][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:40:35,515][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:40:36,037][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:40:36,545][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:40:37,053][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:40:37,569][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:40:38,077][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:40:38,584][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:40:39,091][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:40:39,603][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:40:40,111][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:40:40,618][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:40:41,130][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:40:41,638][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:40:42,143][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:40:42,649][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:40:43,153][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:40:43,662][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:40:44,169][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:40:44,675][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:40:45,182][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:40:45,687][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:40:46,194][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:40:46,697][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:40:47,200][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:40:47,703][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:40:48,212][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:40:48,721][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:40:49,231][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:40:49,738][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:40:50,242][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:40:50,752][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:40:51,255][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:40:51,764][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:40:52,272][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:40:52,777][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:40:53,282][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:40:53,786][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:40:54,311][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:40:54,821][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:40:55,329][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:40:55,835][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:40:56,339][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:40:56,842][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:40:57,345][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:40:57,848][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:40:58,347][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:40:58,847][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:40:59,347][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:40:59,846][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:41:00,346][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:41:00,845][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:41:01,366][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:41:01,866][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:41:02,367][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:41:02,866][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:41:03,365][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:41:03,874][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:41:04,376][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:41:04,880][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:41:05,384][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:41:05,886][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:41:06,390][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:41:06,894][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:41:07,398][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10310 tokens. [2025-11-13 04:41:08,137][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.48%, ΔTime: 00:00:33 [2025-11-13 04:41:08,869][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:41:08,871][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:41:08,873][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:41:09,870][__main__][INFO] - Iteration 433 took 1m 5s (44.27% Gen, 54.20% Train). Generation: 29s, Training: 35s. Estimated remaining time: 48h 7m 31s. Estimated total time: 54h 38m 59s. Time estimates for 10 more iterations: 10m 55s, 100 more iterations: 1h 49m 17s, 500 more iterations: 9h 6m 29s. [2025-11-13 04:41:09,872][__main__][INFO] - Starting iteration 433. [2025-11-13 04:41:10,370][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 04:41:10,371][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:41:40,424][__main__][INFO] - Number of regex retries in iteration 433: 0 [2025-11-13 04:41:40,426][__main__][INFO] - agents played in iteration 433 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:41:41,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:41:41,357][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:41:41,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:41:41,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:41:41,405][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:41:41,406][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:41:42,224][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:41:42,693][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:41:43,204][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:41:43,715][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:41:44,224][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:41:44,729][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:41:45,236][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:41:45,745][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:41:46,249][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:41:46,755][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:41:47,280][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:41:47,791][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:41:48,296][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:41:48,801][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:41:49,307][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:41:49,812][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:41:50,315][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:41:50,817][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:41:51,323][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:41:51,827][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:41:52,328][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:41:52,830][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:41:53,337][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:41:53,841][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:41:54,348][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:41:54,849][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:41:55,354][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:41:55,857][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:41:56,372][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:41:56,875][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:41:57,381][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:41:57,886][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:41:58,389][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:41:58,894][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:41:59,398][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:41:59,902][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:42:00,411][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:42:00,911][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:42:01,412][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:42:01,917][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:42:02,418][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:42:02,926][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:42:03,429][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:42:03,933][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:42:04,432][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:42:04,931][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:42:05,436][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:42:05,938][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:42:06,443][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:42:06,953][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:42:07,453][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:42:07,979][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:42:08,481][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:42:08,985][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:42:09,488][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:42:09,989][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:42:10,495][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:42:10,995][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:42:11,493][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:42:11,997][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:42:12,497][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:42:12,997][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:42:13,497][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:42:13,997][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:42:14,497][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10097 tokens. [2025-11-13 04:42:15,249][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.21%, ΔTime: 00:00:33 [2025-11-13 04:42:15,883][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:42:15,885][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:42:15,887][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:42:16,775][__main__][INFO] - Iteration 434 took 1m 6s (45.26% Gen, 53.40% Train). Generation: 30s, Training: 35s. Estimated remaining time: 48h 47m 41s. Estimated total time: 55h 20m 16s. Time estimates for 10 more iterations: 11m 4s, 100 more iterations: 1h 50m 40s, 500 more iterations: 9h 13m 22s. [2025-11-13 04:42:16,777][__main__][INFO] - Starting iteration 434. [2025-11-13 04:42:17,279][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 04:42:17,280][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:42:47,638][__main__][INFO] - Number of regex retries in iteration 434: 0 [2025-11-13 04:42:47,639][__main__][INFO] - agents played in iteration 434 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:42:48,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:42:48,480][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:42:48,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:42:48,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:42:48,542][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:42:48,543][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:42:49,334][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:42:49,803][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:42:50,309][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:42:50,813][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:42:51,325][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:42:51,828][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:42:52,331][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:42:52,834][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:42:53,339][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:42:53,846][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:42:54,359][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:42:54,867][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:42:55,375][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:42:55,880][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:42:56,395][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:42:56,902][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:42:57,405][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:42:57,910][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:42:58,417][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:42:58,921][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:42:59,424][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:42:59,934][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:43:00,440][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:43:00,948][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:43:01,451][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:43:01,967][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:43:02,469][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:43:02,987][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:43:03,493][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:43:04,007][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:43:04,513][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:43:05,017][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:43:05,525][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:43:06,030][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:43:06,532][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:43:07,041][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:43:07,542][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:43:08,048][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:43:08,552][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:43:09,056][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:43:09,570][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:43:10,074][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:43:10,578][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:43:11,091][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:43:11,597][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:43:12,104][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:43:12,606][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:43:13,108][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:43:13,616][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:43:14,142][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:43:14,652][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:43:15,167][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:43:15,674][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:43:16,198][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:43:16,708][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:43:17,210][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:43:17,716][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:43:18,221][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:43:18,729][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:43:19,233][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:43:19,737][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:43:20,240][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:43:20,746][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:43:21,247][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:43:21,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10257 tokens. [2025-11-13 04:43:22,499][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 04:43:23,232][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:43:23,234][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:43:23,236][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:43:24,297][__main__][INFO] - Iteration 435 took 1m 7s (45.30% Gen, 53.11% Train). Generation: 30s, Training: 35s. Estimated remaining time: 49h 17m 13s. Estimated total time: 55h 50m 55s. Time estimates for 10 more iterations: 11m 10s, 100 more iterations: 1h 51m 41s, 500 more iterations: 9h 18m 29s. [2025-11-13 04:43:24,299][__main__][INFO] - Starting iteration 435. [2025-11-13 04:43:24,764][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 04:43:24,765][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:43:48,776][__main__][INFO] - Number of regex retries in iteration 435: 0 [2025-11-13 04:43:48,778][__main__][INFO] - agents played in iteration 435 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:43:49,668][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:43:49,705][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:43:49,734][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:43:49,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:43:49,758][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:43:49,759][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:43:50,575][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:43:51,038][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:43:51,549][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:43:52,055][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:43:52,559][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:43:53,067][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:43:53,574][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:43:54,088][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:43:54,600][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:43:55,111][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:43:55,619][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:43:56,128][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:43:56,636][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:43:57,143][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:43:57,654][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:43:58,158][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:43:58,667][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:43:59,171][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:43:59,679][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:44:00,192][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:44:00,695][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:44:01,214][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:44:01,717][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:44:02,223][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:44:02,727][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:44:03,233][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:44:03,737][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:44:04,243][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:44:04,752][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:44:05,259][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:44:05,762][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:44:06,264][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:44:06,776][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:44:07,282][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:44:07,807][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:44:08,311][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:44:08,816][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:44:09,322][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:44:09,824][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:44:10,328][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:44:10,836][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:44:11,359][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:44:11,870][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:44:12,375][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:44:12,879][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:44:13,386][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:44:13,887][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:44:14,399][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:44:14,902][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:44:15,406][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:44:15,906][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:44:16,409][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:44:16,910][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:44:17,412][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:44:17,910][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:44:18,419][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:44:18,918][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:44:19,419][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:44:19,923][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:44:20,426][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:44:20,928][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:44:21,428][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:44:21,927][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:44:22,429][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:44:22,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10232 tokens. [2025-11-13 04:44:23,691][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.17%, ΔTime: 00:00:33 [2025-11-13 04:44:24,323][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:44:24,325][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:44:24,327][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:44:25,259][__main__][INFO] - Iteration 436 took 1m 0s (39.69% Gen, 58.76% Train). Generation: 24s, Training: 35s. Estimated remaining time: 43h 50m 4s. Estimated total time: 50h 24m 48s. Time estimates for 10 more iterations: 10m 4s, 100 more iterations: 1h 40m 49s, 500 more iterations: 8h 24m 8s. [2025-11-13 04:44:25,261][__main__][INFO] - Starting iteration 436. [2025-11-13 04:44:25,745][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 04:44:25,745][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:44:53,313][__main__][INFO] - Number of regex retries in iteration 436: 0 [2025-11-13 04:44:53,313][__main__][INFO] - agents played in iteration 436 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:44:54,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:44:54,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:44:54,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:44:54,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:44:54,211][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:44:54,212][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:44:55,036][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:44:55,503][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:44:56,113][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:44:56,617][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:44:57,121][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:44:57,633][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:44:58,138][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:44:58,647][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:44:59,153][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:44:59,662][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:45:00,167][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:45:00,670][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:45:01,173][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:45:01,679][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:45:02,183][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:45:02,688][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:45:03,193][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:45:03,699][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:45:04,218][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:45:04,722][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:45:05,225][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:45:05,727][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:45:06,231][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:45:06,731][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:45:07,232][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:45:07,733][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:45:08,239][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:45:08,743][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:45:09,245][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:45:09,748][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:45:10,254][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:45:10,763][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:45:11,267][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:45:11,774][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:45:12,284][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:45:12,787][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:45:13,291][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:45:13,796][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:45:14,302][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:45:14,828][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:45:15,329][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:45:15,830][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:45:16,327][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:45:16,828][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:45:17,336][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:45:17,835][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:45:18,336][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:45:18,844][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:45:19,344][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:45:19,849][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:45:20,348][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:45:20,851][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:45:21,352][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:45:21,853][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:45:22,357][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:45:22,862][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:45:23,364][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:45:23,870][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:45:24,372][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:45:24,876][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:45:25,385][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:45:25,888][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:45:26,392][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:45:26,896][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:45:27,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10218 tokens. [2025-11-13 04:45:28,140][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:33 [2025-11-13 04:45:28,857][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:45:28,859][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:45:28,861][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:45:30,030][__main__][INFO] - Iteration 437 took 1m 4s (42.88% Gen, 55.30% Train). Generation: 27s, Training: 35s. Estimated remaining time: 46h 58m 29s. Estimated total time: 53h 34m 18s. Time estimates for 10 more iterations: 10m 42s, 100 more iterations: 1h 47m 8s, 500 more iterations: 8h 55m 43s. [2025-11-13 04:45:30,032][__main__][INFO] - Starting iteration 437. [2025-11-13 04:45:30,507][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 04:45:30,508][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:45:55,173][__main__][INFO] - Number of regex retries in iteration 437: 0 [2025-11-13 04:45:55,175][__main__][INFO] - agents played in iteration 437 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:45:56,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:45:56,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:45:56,242][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:45:56,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:45:56,267][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:45:56,268][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:45:57,116][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:45:57,583][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:45:58,095][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:45:58,607][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:45:59,114][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:45:59,634][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:46:00,143][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:46:00,650][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:46:01,164][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:46:01,667][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:46:02,178][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:46:02,683][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:46:03,188][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:46:03,694][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:46:04,200][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:46:04,724][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:46:05,229][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:46:05,732][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:46:06,234][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:46:06,734][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:46:07,240][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:46:07,743][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:46:08,245][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:46:08,753][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:46:09,255][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:46:09,761][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:46:10,262][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:46:10,768][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:46:11,271][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:46:11,772][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:46:12,272][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:46:12,777][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:46:13,280][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:46:13,788][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:46:14,293][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:46:14,799][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:46:15,318][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:46:15,825][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:46:16,331][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:46:16,839][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:46:17,339][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:46:17,843][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:46:18,343][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:46:18,850][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:46:19,354][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:46:19,853][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:46:20,352][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:46:20,855][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:46:21,355][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:46:21,856][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:46:22,355][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:46:22,858][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:46:23,359][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:46:23,862][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:46:24,365][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:46:24,864][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:46:25,364][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:46:25,865][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:46:26,365][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:46:26,865][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:46:27,367][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:46:27,866][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:46:28,366][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:46:28,865][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:46:29,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10095 tokens. [2025-11-13 04:46:30,111][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.07%, Current % of VRAM taken: 58.31%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:33 [2025-11-13 04:46:30,731][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:46:30,732][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:46:30,734][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:46:31,646][__main__][INFO] - Iteration 438 took 1m 1s (40.35% Gen, 58.16% Train). Generation: 24s, Training: 35s. Estimated remaining time: 44h 20m 7s. Estimated total time: 50h 56m 57s. Time estimates for 10 more iterations: 10m 11s, 100 more iterations: 1h 41m 53s, 500 more iterations: 8h 29m 29s. [2025-11-13 04:46:31,648][__main__][INFO] - Starting iteration 438. [2025-11-13 04:46:32,131][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 04:46:32,132][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:47:04,692][__main__][INFO] - Number of regex retries in iteration 438: 0 [2025-11-13 04:47:04,693][__main__][INFO] - agents played in iteration 438 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:47:05,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:47:05,574][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:47:05,599][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:47:05,622][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:47:05,622][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:47:05,623][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:47:06,436][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:47:06,896][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:47:07,423][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:47:07,928][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:47:08,437][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:47:08,947][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:47:09,450][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:47:09,960][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:47:10,464][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:47:10,967][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:47:11,475][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:47:11,977][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:47:12,480][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:47:12,983][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:47:13,490][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:47:13,997][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:47:14,499][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:47:15,003][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:47:15,509][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:47:16,013][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:47:16,529][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:47:17,032][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:47:17,534][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:47:18,041][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:47:18,543][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:47:19,054][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:47:19,558][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:47:20,087][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:47:20,597][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:47:21,102][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:47:21,615][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:47:22,121][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:47:22,631][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:47:23,136][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:47:23,640][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:47:24,146][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:47:24,646][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:47:25,147][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:47:25,660][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:47:26,163][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:47:26,664][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:47:27,168][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:47:27,670][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:47:28,173][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:47:28,678][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:47:29,181][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:47:29,688][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:47:30,190][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:47:30,691][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:47:31,191][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:47:31,696][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:47:32,197][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:47:32,698][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:47:33,196][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:47:33,707][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:47:34,208][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:47:34,712][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:47:35,213][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:47:35,714][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:47:36,234][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:47:36,733][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:47:37,234][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:47:37,736][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:47:38,236][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:47:38,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10140 tokens. [2025-11-13 04:47:39,469][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:33 [2025-11-13 04:47:40,222][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:47:40,223][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:47:40,225][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:47:41,201][__main__][INFO] - Iteration 439 took 1m 9s (47.14% Gen, 51.44% Train). Generation: 32s, Training: 35s. Estimated remaining time: 50h 55m 30s. Estimated total time: 57h 33m 30s. Time estimates for 10 more iterations: 11m 30s, 100 more iterations: 1h 55m 7s, 500 more iterations: 9h 35m 35s. [2025-11-13 04:47:41,203][__main__][INFO] - Starting iteration 439. [2025-11-13 04:47:41,679][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 04:47:41,679][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:47:57,099][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 04:48:00,765][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 04:48:09,663][__main__][INFO] - Number of regex retries in iteration 439: 2 [2025-11-13 04:48:09,664][__main__][INFO] - agents played in iteration 439 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:48:10,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:48:10,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:48:10,587][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:48:10,610][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:48:10,610][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:48:10,611][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:48:11,452][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:48:11,929][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:48:12,440][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:48:12,950][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:48:13,454][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:48:13,960][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:48:14,474][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:48:14,981][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:48:15,486][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:48:15,995][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:48:16,500][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:48:17,009][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:48:17,513][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:48:18,023][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:48:18,536][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:48:19,042][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:48:19,552][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:48:20,065][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:48:20,574][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:48:21,078][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:48:21,582][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:48:22,086][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:48:22,592][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:48:23,094][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:48:23,598][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:48:24,105][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:48:24,615][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:48:25,130][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:48:25,634][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:48:26,154][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:48:26,663][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:48:27,168][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:48:27,673][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:48:28,179][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:48:28,686][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:48:29,188][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:48:29,691][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:48:30,199][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:48:30,706][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:48:31,218][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:48:31,723][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:48:32,229][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:48:32,738][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:48:33,242][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:48:33,743][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:48:34,247][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:48:34,751][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:48:35,264][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:48:35,770][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:48:36,276][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:48:36,774][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:48:37,278][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:48:37,781][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:48:38,283][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:48:38,785][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:48:39,287][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:48:39,789][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:48:40,299][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:48:40,806][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:48:41,307][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:48:41,823][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:48:42,326][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:48:42,830][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:48:43,339][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:48:43,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10233 tokens. [2025-11-13 04:48:44,617][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 04:48:45,233][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:48:45,235][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:48:45,237][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:48:46,140][__main__][INFO] - Iteration 440 took 1m 4s (43.42% Gen, 55.18% Train). Generation: 27s, Training: 35s. Estimated remaining time: 47h 4m 0s. Estimated total time: 53h 43m 5s. Time estimates for 10 more iterations: 10m 44s, 100 more iterations: 1h 47m 26s, 500 more iterations: 8h 57m 10s. [2025-11-13 04:48:46,142][__main__][INFO] - Starting iteration 440. [2025-11-13 04:48:46,624][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 43 and human policies 1. [2025-11-13 04:48:46,625][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:49:08,642][__main__][INFO] - Number of regex retries in iteration 440: 0 [2025-11-13 04:49:08,643][__main__][INFO] - agents played in iteration 440 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:49:09,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:49:09,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:49:09,586][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:49:09,610][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:49:09,611][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:49:09,611][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:49:10,444][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:49:10,912][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:49:11,423][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:49:11,933][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:49:12,459][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:49:12,969][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:49:13,474][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:49:13,982][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:49:14,493][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:49:15,005][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:49:15,514][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:49:16,021][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:49:16,536][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:49:17,044][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:49:17,558][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:49:18,066][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:49:18,571][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:49:19,073][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:49:19,575][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:49:20,082][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:49:20,587][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:49:21,094][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:49:21,600][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:49:22,106][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:49:22,614][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:49:23,131][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:49:23,639][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:49:24,156][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:49:24,661][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:49:25,167][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:49:25,676][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:49:26,184][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:49:26,691][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:49:27,196][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:49:27,706][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:49:28,214][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:49:28,719][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:49:29,222][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:49:29,732][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:49:30,235][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:49:30,755][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:49:31,258][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:49:31,765][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:49:32,274][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:49:32,781][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:49:33,286][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:49:33,790][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:49:34,292][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:49:34,798][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:49:35,305][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:49:35,809][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:49:36,314][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:49:36,816][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:49:37,318][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:49:37,824][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:49:38,329][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:49:38,856][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:49:39,363][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:49:39,873][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:49:40,381][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:49:40,886][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:49:41,396][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:49:41,904][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:49:42,409][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:49:42,914][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10099 tokens. [2025-11-13 04:49:43,672][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 04:49:44,420][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:49:44,421][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:49:44,423][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:49:46,448][__main__][INFO] - Iteration 441 took 59s (36.80% Gen, 59.81% Train). Generation: 22s, Training: 35s. Estimated remaining time: 43h 11m 8s. Estimated total time: 49h 51m 13s. Time estimates for 10 more iterations: 9m 58s, 100 more iterations: 1h 39m 42s, 500 more iterations: 8h 18m 32s. [2025-11-13 04:49:46,451][__main__][INFO] - Starting iteration 441. [2025-11-13 04:49:46,936][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 04:49:46,938][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:50:14,117][__main__][INFO] - Number of regex retries in iteration 441: 0 [2025-11-13 04:50:14,118][__main__][INFO] - agents played in iteration 441 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:50:14,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:50:15,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:50:15,041][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:50:15,066][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:50:15,066][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:50:15,067][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:50:15,899][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:50:16,364][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:50:16,885][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:50:17,392][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:50:17,907][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:50:18,416][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:50:18,923][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:50:19,434][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:50:19,941][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:50:20,448][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:50:20,961][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:50:21,467][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:50:21,976][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:50:22,487][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:50:22,995][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:50:23,505][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:50:24,014][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:50:24,521][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:50:25,025][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:50:25,528][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:50:26,033][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:50:26,537][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:50:27,040][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:50:27,561][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:50:28,068][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:50:28,573][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:50:29,080][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:50:29,585][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:50:30,093][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:50:30,596][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:50:31,099][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:50:31,603][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:50:32,105][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:50:32,609][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:50:33,114][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:50:33,617][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:50:34,123][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:50:34,627][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:50:35,133][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:50:35,639][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:50:36,143][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:50:36,663][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:50:37,174][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:50:37,686][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:50:38,189][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:50:38,693][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:50:39,202][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:50:39,704][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:50:40,211][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:50:40,719][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:50:41,223][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:50:41,733][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:50:42,239][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:50:42,748][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:50:43,272][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:50:43,781][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:50:44,291][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:50:44,792][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:50:45,296][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:50:45,799][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:50:46,302][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:50:46,808][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:50:47,313][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:50:47,815][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:50:48,319][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10184 tokens. [2025-11-13 04:50:49,104][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:33 [2025-11-13 04:50:49,749][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:50:49,750][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:50:49,752][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:50:50,672][__main__][INFO] - Iteration 442 took 1m 3s (42.64% Gen, 55.91% Train). Generation: 27s, Training: 35s. Estimated remaining time: 46h 25m 43s. Estimated total time: 53h 6m 52s. Time estimates for 10 more iterations: 10m 37s, 100 more iterations: 1h 46m 13s, 500 more iterations: 8h 51m 8s. [2025-11-13 04:50:50,675][__main__][INFO] - Starting iteration 442. [2025-11-13 04:50:51,189][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 04:50:51,190][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:51:25,276][__main__][INFO] - Number of regex retries in iteration 442: 0 [2025-11-13 04:51:25,277][__main__][INFO] - agents played in iteration 442 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:51:26,141][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:51:26,164][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:51:26,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:51:26,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:51:26,212][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:51:26,213][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:51:27,020][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:51:27,484][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:51:27,999][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:51:28,505][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:51:29,010][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:51:29,517][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:51:30,022][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:51:30,533][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:51:31,040][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:51:31,545][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:51:32,070][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:51:32,579][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:51:33,120][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:51:33,627][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:51:34,140][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:51:34,652][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:51:35,157][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:51:35,669][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:51:36,173][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:51:36,677][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:51:37,187][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:51:37,694][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:51:38,199][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:51:38,703][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:51:39,213][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:51:39,718][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:51:40,224][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:51:40,726][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:51:41,229][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:51:41,731][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:51:42,233][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:51:42,734][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:51:43,238][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:51:43,759][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:51:44,266][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:51:44,782][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:51:45,290][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:51:45,797][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:51:46,304][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:51:46,808][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:51:47,313][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:51:47,821][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:51:48,325][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:51:48,835][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:51:49,343][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:51:49,851][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:51:50,360][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:51:50,868][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:51:51,394][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:51:51,895][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:51:52,399][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:51:52,901][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:51:53,405][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:51:53,909][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:51:54,411][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:51:54,914][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:51:55,417][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:51:55,918][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:51:56,420][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:51:56,924][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:51:57,425][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:51:57,927][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:51:58,429][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:51:58,936][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:51:59,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10152 tokens. [2025-11-13 04:52:00,233][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.99%, Current % of VRAM taken: 58.24%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:33 [2025-11-13 04:52:00,961][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:52:00,963][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:52:00,966][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:52:02,010][__main__][INFO] - Iteration 443 took 1m 10s (48.13% Gen, 50.39% Train). Generation: 34s, Training: 35s. Estimated remaining time: 52h 18m 41s. Estimated total time: 59h 1m 2s. Time estimates for 10 more iterations: 11m 48s, 100 more iterations: 1h 58m 2s, 500 more iterations: 9h 50m 10s. [2025-11-13 04:52:02,012][__main__][INFO] - Starting iteration 443. [2025-11-13 04:52:02,486][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 04:52:02,487][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:52:30,515][__main__][INFO] - Number of regex retries in iteration 443: 0 [2025-11-13 04:52:30,516][__main__][INFO] - agents played in iteration 443 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:52:31,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:52:31,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:52:31,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:52:31,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:52:31,454][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:52:31,454][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:52:32,273][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:52:32,738][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:52:33,251][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:52:33,776][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:52:34,285][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:52:34,798][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:52:35,301][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:52:35,809][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:52:36,319][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:52:36,824][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:52:37,330][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:52:37,833][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:52:38,342][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:52:38,847][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:52:39,353][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:52:39,857][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:52:40,359][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:52:40,862][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:52:41,380][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:52:41,893][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:52:42,397][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:52:42,901][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:52:43,409][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:52:43,921][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:52:44,428][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:52:44,934][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:52:45,443][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:52:45,987][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:52:46,495][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:52:47,001][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:52:47,514][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:52:48,023][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:52:48,531][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:52:49,037][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:52:49,545][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:52:50,048][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:52:50,555][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:52:51,058][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:52:51,561][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:52:52,065][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:52:52,568][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:52:53,073][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:52:53,582][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:52:54,086][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:52:54,605][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:52:55,107][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:52:55,611][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:52:56,112][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:52:56,615][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:52:57,116][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:52:57,618][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:52:58,123][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:52:58,629][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:52:59,131][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:52:59,633][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:53:00,137][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:53:00,645][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:53:01,149][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:53:01,653][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:53:02,160][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:53:02,674][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:53:03,176][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:53:03,683][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:53:04,194][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:53:04,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10241 tokens. [2025-11-13 04:53:05,513][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:33 [2025-11-13 04:53:06,157][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:53:06,160][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:53:06,162][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:53:07,061][__main__][INFO] - Iteration 444 took 1m 4s (43.41% Gen, 55.20% Train). Generation: 28s, Training: 35s. Estimated remaining time: 47h 5m 20s. Estimated total time: 53h 48m 46s. Time estimates for 10 more iterations: 10m 45s, 100 more iterations: 1h 47m 37s, 500 more iterations: 8h 58m 7s. [2025-11-13 04:53:07,063][__main__][INFO] - Starting iteration 444. [2025-11-13 04:53:07,554][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 04:53:07,555][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:53:39,159][__main__][INFO] - Number of regex retries in iteration 444: 0 [2025-11-13 04:53:39,159][__main__][INFO] - agents played in iteration 444 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:53:40,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:53:40,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:53:40,055][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:53:40,079][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:53:40,080][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:53:40,081][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:53:40,848][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:53:41,309][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:53:41,816][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:53:42,328][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:53:42,831][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:53:43,336][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:53:43,844][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:53:44,350][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:53:44,855][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:53:45,382][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:53:45,891][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:53:46,413][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:53:46,920][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:53:47,428][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:53:47,932][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:53:48,438][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:53:48,945][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:53:49,449][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:53:49,955][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:53:50,464][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:53:50,967][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:53:51,473][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:53:51,978][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:53:52,482][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:53:53,000][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:53:53,504][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:53:54,021][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:53:54,534][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:53:55,039][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:53:55,541][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:53:56,043][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:53:56,546][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:53:57,051][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:53:57,554][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:53:58,056][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:53:58,558][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:53:59,061][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:53:59,566][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:54:00,070][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:54:00,573][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:54:01,078][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:54:01,584][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:54:02,085][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:54:02,586][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:54:03,105][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:54:03,608][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:54:04,113][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:54:04,622][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:54:05,132][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:54:05,635][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:54:06,140][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:54:06,646][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:54:07,151][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:54:07,660][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:54:08,170][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:54:08,678][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:54:09,184][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:54:09,690][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:54:10,200][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:54:10,707][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:54:11,212][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:54:11,715][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:54:12,234][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:54:12,737][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:54:13,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10205 tokens. [2025-11-13 04:54:14,115][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.05%, Current % of VRAM taken: 58.29%, Block Peak % of device VRAM: 62.22%, ΔTime: 00:00:33 [2025-11-13 04:54:14,875][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:54:14,877][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:54:14,879][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:54:15,924][__main__][INFO] - Iteration 445 took 1m 8s (46.23% Gen, 52.25% Train). Generation: 31s, Training: 35s. Estimated remaining time: 50h 13m 57s. Estimated total time: 56h 58m 31s. Time estimates for 10 more iterations: 11m 23s, 100 more iterations: 1h 53m 57s, 500 more iterations: 9h 29m 45s. [2025-11-13 04:54:15,926][__main__][INFO] - Starting iteration 445. [2025-11-13 04:54:16,415][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 04:54:16,416][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:54:40,933][__main__][INFO] - Number of regex retries in iteration 445: 0 [2025-11-13 04:54:40,935][__main__][INFO] - agents played in iteration 445 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:54:41,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:54:41,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:54:41,849][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:54:41,872][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:54:41,872][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:54:41,873][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:54:42,697][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:54:43,169][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:54:43,684][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:54:44,190][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:54:44,695][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:54:45,219][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:54:45,723][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:54:46,238][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:54:46,742][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:54:47,250][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:54:47,765][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:54:48,271][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:54:48,785][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:54:49,299][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:54:49,805][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:54:50,330][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:54:50,840][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:54:51,348][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:54:51,852][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:54:52,358][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:54:52,872][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:54:53,377][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:54:53,887][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:54:54,397][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:54:54,905][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:54:55,412][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:54:55,918][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:54:56,425][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:54:56,931][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:54:57,435][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:54:57,940][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:54:58,446][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:54:58,953][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:54:59,459][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:54:59,968][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:55:00,473][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:55:00,980][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:55:01,482][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:55:01,988][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:55:02,497][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:55:03,007][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:55:03,511][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:55:04,019][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:55:04,528][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:55:05,032][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:55:05,535][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:55:06,044][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:55:06,546][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:55:07,051][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:55:07,560][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:55:08,090][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:55:08,594][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:55:09,102][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:55:09,606][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:55:10,115][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:55:10,623][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:55:11,135][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:55:11,640][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:55:12,143][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:55:12,652][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:55:13,156][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:55:13,660][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:55:14,163][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:55:14,666][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:55:15,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10221 tokens. [2025-11-13 04:55:15,985][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:33 [2025-11-13 04:55:16,653][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:55:16,656][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:55:16,659][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:55:17,605][__main__][INFO] - Iteration 446 took 1m 1s (40.07% Gen, 58.38% Train). Generation: 24s, Training: 35s. Estimated remaining time: 44h 13m 55s. Estimated total time: 50h 59m 31s. Time estimates for 10 more iterations: 10m 11s, 100 more iterations: 1h 41m 59s, 500 more iterations: 8h 29m 55s. [2025-11-13 04:55:17,607][__main__][INFO] - Starting iteration 446. [2025-11-13 04:55:18,106][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 04:55:18,106][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:55:47,369][__main__][INFO] - Number of regex retries in iteration 446: 0 [2025-11-13 04:55:47,369][__main__][INFO] - agents played in iteration 446 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:55:48,218][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:55:48,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:55:48,266][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:55:48,289][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:55:48,289][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:55:48,290][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:55:49,097][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:55:49,567][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:55:50,081][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:55:50,589][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:55:51,103][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:55:51,613][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:55:52,123][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:55:52,634][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:55:53,142][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:55:53,647][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:55:54,154][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:55:54,667][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:55:55,173][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:55:55,676][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:55:56,180][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:55:56,686][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:55:57,191][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:55:57,697][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:55:58,201][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:55:58,712][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:55:59,220][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:55:59,728][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:56:00,231][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:56:00,743][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:56:01,253][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:56:01,756][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:56:02,259][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:56:02,768][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:56:03,272][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:56:03,782][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:56:04,286][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:56:04,793][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:56:05,300][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:56:05,804][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:56:06,308][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:56:06,820][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:56:07,324][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:56:07,836][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:56:08,346][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:56:08,855][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:56:09,361][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:56:09,863][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:56:10,375][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:56:10,883][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:56:11,388][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:56:11,891][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:56:12,394][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:56:12,900][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:56:13,405][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:56:13,908][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:56:14,422][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:56:14,924][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:56:15,432][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:56:15,938][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:56:16,439][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:56:16,943][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:56:17,444][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:56:17,949][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:56:18,452][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:56:18,957][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:56:19,468][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:56:19,977][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:56:20,486][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:56:20,992][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:56:21,495][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10212 tokens. [2025-11-13 04:56:22,262][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 04:56:22,985][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:56:22,987][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:56:22,988][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:56:24,028][__main__][INFO] - Iteration 447 took 1m 5s (44.39% Gen, 54.03% Train). Generation: 29s, Training: 35s. Estimated remaining time: 48h 9m 28s. Estimated total time: 54h 56m 10s. Time estimates for 10 more iterations: 10m 59s, 100 more iterations: 1h 49m 52s, 500 more iterations: 9h 9m 21s. [2025-11-13 04:56:24,031][__main__][INFO] - Starting iteration 447. [2025-11-13 04:56:24,516][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 04:56:24,517][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:56:47,852][__main__][INFO] - Number of regex retries in iteration 447: 0 [2025-11-13 04:56:47,853][__main__][INFO] - agents played in iteration 447 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:56:48,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:56:48,697][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:56:48,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:56:48,746][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:56:48,747][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:56:48,748][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:56:49,555][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:56:50,029][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:56:50,540][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:56:51,052][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:56:51,566][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:56:52,074][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:56:52,585][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:56:53,093][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:56:53,602][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:56:54,109][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:56:54,615][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:56:55,125][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:56:55,636][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:56:56,141][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:56:56,653][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:56:57,167][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:56:57,677][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:56:58,183][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:56:58,695][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:56:59,207][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:56:59,713][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:57:00,219][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:57:00,726][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:57:01,236][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:57:01,744][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:57:02,251][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:57:02,763][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:57:03,269][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:57:03,776][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:57:04,286][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:57:04,792][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:57:05,296][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:57:05,800][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:57:06,307][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:57:06,817][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:57:07,326][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:57:07,832][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:57:08,339][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:57:08,845][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:57:09,351][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:57:09,858][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:57:10,366][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:57:10,885][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:57:11,393][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:57:11,902][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:57:12,415][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:57:12,925][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:57:13,432][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:57:13,941][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:57:14,446][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:57:14,950][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:57:15,452][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:57:15,975][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:57:16,475][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:57:16,976][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:57:17,484][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:57:17,988][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:57:18,493][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:57:18,995][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:57:19,498][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:57:20,003][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:57:20,506][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:57:21,011][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:57:21,516][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:57:22,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10173 tokens. [2025-11-13 04:57:22,786][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 04:57:23,418][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:57:23,420][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:57:23,421][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:57:24,428][__main__][INFO] - Iteration 448 took 59s (38.95% Gen, 59.37% Train). Generation: 23s, Training: 35s. Estimated remaining time: 43h 7m 55s. Estimated total time: 49h 55m 38s. Time estimates for 10 more iterations: 9m 59s, 100 more iterations: 1h 39m 51s, 500 more iterations: 8h 19m 16s. [2025-11-13 04:57:24,430][__main__][INFO] - Starting iteration 448. [2025-11-13 04:57:24,900][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 04:57:24,900][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:57:55,846][__main__][INFO] - Number of regex retries in iteration 448: 0 [2025-11-13 04:57:55,847][__main__][INFO] - agents played in iteration 448 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:57:56,656][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:57:56,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:57:56,718][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:57:56,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:57:56,742][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:57:56,743][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:57:57,576][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:57:58,037][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:57:58,547][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:57:59,058][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:57:59,579][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:58:00,087][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:58:00,594][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:58:01,105][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:58:01,608][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:58:02,111][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:58:02,613][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:58:03,116][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:58:03,620][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:58:04,124][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:58:04,629][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:58:05,135][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:58:05,643][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:58:06,149][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:58:06,656][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:58:07,164][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:58:07,672][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:58:08,175][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:58:08,681][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:58:09,186][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:58:09,691][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:58:10,206][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:58:10,714][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:58:11,221][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:58:11,732][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:58:12,240][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:58:12,745][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:58:13,249][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:58:13,757][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:58:14,268][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:58:14,771][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:58:15,273][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:58:15,774][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:58:16,274][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:58:16,788][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:58:17,289][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:58:17,790][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:58:18,292][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:58:18,795][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:58:19,297][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:58:19,819][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:58:20,329][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:58:20,834][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:58:21,336][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:58:21,837][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:58:22,336][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:58:22,837][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:58:23,339][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:58:23,843][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:58:24,344][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:58:24,846][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:58:25,351][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:58:25,851][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:58:26,353][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:58:26,854][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:58:27,365][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:58:27,867][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:58:28,371][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:58:28,879][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:58:29,383][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:58:29,892][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10105 tokens. [2025-11-13 04:58:30,637][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 04:58:31,373][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:58:31,374][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:58:31,376][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:58:32,396][__main__][INFO] - Iteration 449 took 1m 7s (45.85% Gen, 52.64% Train). Generation: 30s, Training: 35s. Estimated remaining time: 49h 26m 0s. Estimated total time: 56h 14m 51s. Time estimates for 10 more iterations: 11m 14s, 100 more iterations: 1h 52m 29s, 500 more iterations: 9h 22m 28s. [2025-11-13 04:58:32,398][__main__][INFO] - Starting iteration 449. [2025-11-13 04:58:32,868][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 04:58:32,869][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 04:58:58,729][__main__][INFO] - Number of regex retries in iteration 449: 0 [2025-11-13 04:58:58,731][__main__][INFO] - agents played in iteration 449 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 04:58:59,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:58:59,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:58:59,697][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:58:59,725][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 04:58:59,726][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 04:58:59,727][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 04:59:00,536][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 04:59:01,002][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 04:59:01,510][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 04:59:02,014][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 04:59:02,518][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 04:59:03,033][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 04:59:03,536][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 04:59:04,040][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 04:59:04,548][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 04:59:05,058][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 04:59:05,587][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 04:59:06,095][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 04:59:06,601][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 04:59:07,109][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 04:59:07,611][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 04:59:08,115][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 04:59:08,625][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 04:59:09,130][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 04:59:09,641][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 04:59:10,147][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 04:59:10,652][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 04:59:11,157][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 04:59:11,672][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 04:59:12,197][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 04:59:12,703][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 04:59:13,206][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 04:59:13,711][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 04:59:14,215][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 04:59:14,717][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 04:59:15,220][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 04:59:15,722][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 04:59:16,231][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 04:59:16,732][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 04:59:17,243][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 04:59:17,749][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 04:59:18,256][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 04:59:18,767][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 04:59:19,274][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 04:59:19,788][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 04:59:20,293][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 04:59:20,799][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 04:59:21,308][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 04:59:21,812][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 04:59:22,315][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 04:59:22,818][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 04:59:23,321][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 04:59:23,828][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 04:59:24,330][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 04:59:24,832][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 04:59:25,340][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 04:59:25,849][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 04:59:26,351][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 04:59:26,869][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 04:59:27,373][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 04:59:27,889][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 04:59:28,394][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 04:59:28,896][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 04:59:29,403][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 04:59:29,905][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 04:59:30,418][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 04:59:30,922][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 04:59:31,431][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 04:59:31,936][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 04:59:32,437][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 04:59:32,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10119 tokens. [2025-11-13 04:59:33,708][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 04:59:34,344][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 04:59:34,346][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 04:59:34,348][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 04:59:35,273][__main__][INFO] - Iteration 450 took 1m 2s (41.44% Gen, 57.07% Train). Generation: 25s, Training: 35s. Estimated remaining time: 45h 10m 23s. Estimated total time: 52h 0m 17s. Time estimates for 10 more iterations: 10m 24s, 100 more iterations: 1h 44m 0s, 500 more iterations: 8h 40m 2s. [2025-11-13 04:59:35,275][__main__][INFO] - Starting iteration 450. [2025-11-13 04:59:36,019][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 44 and human policies 1. [2025-11-13 04:59:36,019][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:00:06,859][__main__][INFO] - Number of regex retries in iteration 450: 0 [2025-11-13 05:00:06,860][__main__][INFO] - agents played in iteration 450 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:00:07,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:00:07,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:00:07,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:00:07,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:00:07,737][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:00:07,738][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:00:08,544][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:00:09,007][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:00:09,515][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:00:10,023][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:00:10,532][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:00:11,035][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:00:11,538][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:00:12,040][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:00:12,542][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:00:13,046][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:00:13,549][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:00:14,061][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:00:14,565][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:00:15,069][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:00:15,578][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:00:16,083][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:00:16,595][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:00:17,102][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:00:17,613][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:00:18,121][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:00:18,626][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:00:19,131][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:00:19,641][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:00:20,150][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:00:20,666][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:00:21,170][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:00:21,673][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:00:22,182][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:00:22,688][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:00:23,202][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:00:23,703][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:00:24,207][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:00:24,713][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:00:25,217][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:00:25,724][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:00:26,229][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:00:26,732][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:00:27,242][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:00:27,746][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:00:28,254][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:00:28,761][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:00:29,268][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:00:29,794][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:00:30,310][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:00:30,813][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:00:31,317][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:00:31,822][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:00:32,325][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:00:32,828][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:00:33,332][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:00:33,834][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:00:34,337][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:00:34,841][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:00:35,349][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:00:35,871][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:00:36,388][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:00:36,891][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:00:37,396][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:00:37,901][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:00:38,406][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:00:38,909][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:00:39,412][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:00:39,915][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:00:40,427][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:00:40,932][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10160 tokens. [2025-11-13 05:00:41,756][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.21%, ΔTime: 00:00:33 [2025-11-13 05:00:42,493][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:00:42,495][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:00:42,496][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:00:44,468][__main__][INFO] - Iteration 451 took 1m 8s (45.06% Gen, 52.06% Train). Generation: 30s, Training: 35s. Estimated remaining time: 50h 11m 26s. Estimated total time: 57h 2m 28s. Time estimates for 10 more iterations: 11m 24s, 100 more iterations: 1h 54m 4s, 500 more iterations: 9h 30m 24s. [2025-11-13 05:00:44,470][__main__][INFO] - Starting iteration 451. [2025-11-13 05:00:44,947][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 05:00:44,947][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:01:07,636][__main__][INFO] - Number of regex retries in iteration 451: 0 [2025-11-13 05:01:07,638][__main__][INFO] - agents played in iteration 451 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:01:08,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:01:08,534][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:01:08,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:01:08,584][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:01:08,585][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:01:08,585][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:01:09,448][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:01:09,914][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:01:10,423][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:01:10,936][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:01:11,445][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:01:11,949][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:01:12,454][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:01:12,958][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:01:13,465][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:01:13,968][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:01:14,472][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:01:14,992][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:01:15,494][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:01:16,013][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:01:16,515][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:01:17,020][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:01:17,526][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:01:18,034][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:01:18,549][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:01:19,054][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:01:19,563][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:01:20,094][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:01:20,602][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:01:21,110][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:01:21,616][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:01:22,117][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:01:22,627][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:01:23,132][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:01:23,637][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:01:24,146][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:01:24,651][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:01:25,159][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:01:25,663][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:01:26,168][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:01:26,688][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:01:27,192][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:01:27,697][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:01:28,202][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:01:28,709][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:01:29,219][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:01:29,726][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:01:30,230][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:01:30,738][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:01:31,243][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:01:31,750][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:01:32,260][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:01:32,766][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:01:33,270][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:01:33,775][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:01:34,287][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:01:34,795][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:01:35,299][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:01:35,806][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:01:36,312][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:01:36,817][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:01:37,329][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:01:37,833][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:01:38,337][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:01:38,847][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:01:39,352][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:01:39,854][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:01:40,359][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:01:40,869][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:01:41,376][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:01:41,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10056 tokens. [2025-11-13 05:01:42,732][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:33 [2025-11-13 05:01:43,379][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:01:43,381][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:01:43,383][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:01:44,328][__main__][INFO] - Iteration 452 took 59s (38.21% Gen, 60.20% Train). Generation: 22s, Training: 35s. Estimated remaining time: 42h 37m 1s. Estimated total time: 49h 29m 4s. Time estimates for 10 more iterations: 9m 53s, 100 more iterations: 1h 38m 58s, 500 more iterations: 8h 14m 50s. [2025-11-13 05:01:44,330][__main__][INFO] - Starting iteration 452. [2025-11-13 05:01:44,810][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 05:01:44,810][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:02:14,044][__main__][INFO] - Number of regex retries in iteration 452: 0 [2025-11-13 05:02:14,045][__main__][INFO] - agents played in iteration 452 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:02:14,881][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:02:14,908][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:02:14,934][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:02:14,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:02:14,957][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:02:14,958][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:02:15,791][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:02:16,253][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:02:16,769][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:02:17,275][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:02:17,779][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:02:18,295][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:02:18,808][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:02:19,326][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:02:19,832][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:02:20,336][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:02:20,848][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:02:21,357][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:02:21,862][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:02:22,368][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:02:22,877][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:02:23,385][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:02:23,897][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:02:24,400][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:02:24,913][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:02:25,420][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:02:25,929][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:02:26,431][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:02:26,934][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:02:27,442][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:02:27,945][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:02:28,447][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:02:28,951][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:02:29,457][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:02:29,960][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:02:30,464][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:02:30,970][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:02:31,474][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:02:31,976][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:02:32,482][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:02:33,002][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:02:33,508][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:02:34,023][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:02:34,526][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:02:35,030][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:02:35,539][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:02:36,044][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:02:36,549][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:02:37,052][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:02:37,565][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:02:38,073][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:02:38,580][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:02:39,088][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:02:39,593][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:02:40,099][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:02:40,621][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:02:41,134][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:02:41,638][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:02:42,148][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:02:42,654][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:02:43,160][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:02:43,667][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:02:44,174][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:02:44,687][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:02:45,191][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:02:45,698][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:02:46,201][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:02:46,709][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:02:47,225][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:02:47,729][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:02:48,231][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10224 tokens. [2025-11-13 05:02:49,063][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.06%, Current % of VRAM taken: 58.30%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:33 [2025-11-13 05:02:49,810][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:02:49,811][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:02:49,814][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:02:50,798][__main__][INFO] - Iteration 453 took 1m 5s (44.30% Gen, 54.20% Train). Generation: 29s, Training: 35s. Estimated remaining time: 48h 6m 17s. Estimated total time: 54h 59m 26s. Time estimates for 10 more iterations: 10m 59s, 100 more iterations: 1h 49m 58s, 500 more iterations: 9h 9m 54s. [2025-11-13 05:02:50,800][__main__][INFO] - Starting iteration 453. [2025-11-13 05:02:51,298][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 05:02:51,299][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:03:22,879][__main__][INFO] - Number of regex retries in iteration 453: 0 [2025-11-13 05:03:22,880][__main__][INFO] - agents played in iteration 453 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:03:23,744][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:03:23,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:03:23,799][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:03:23,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:03:23,822][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:03:23,823][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:03:24,618][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:03:25,084][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:03:25,594][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:03:26,101][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:03:26,621][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:03:27,130][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:03:27,637][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:03:28,145][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:03:28,652][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:03:29,157][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:03:29,664][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:03:30,171][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:03:30,675][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:03:31,182][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:03:31,688][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:03:32,192][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:03:32,700][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:03:33,209][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:03:33,715][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:03:34,218][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:03:34,742][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:03:35,246][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:03:35,750][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:03:36,255][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:03:36,756][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:03:37,265][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:03:37,774][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:03:38,284][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:03:38,793][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:03:39,305][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:03:39,816][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:03:40,325][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:03:40,833][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:03:41,345][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:03:41,852][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:03:42,362][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:03:42,868][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:03:43,377][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:03:43,886][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:03:44,396][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:03:44,905][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:03:45,413][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:03:45,918][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:03:46,439][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:03:46,944][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:03:47,452][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:03:47,960][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:03:48,474][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:03:48,978][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:03:49,485][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:03:49,990][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:03:50,498][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:03:51,006][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:03:51,517][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:03:52,020][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:03:52,527][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:03:53,051][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:03:53,558][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:03:54,061][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:03:54,573][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:03:55,080][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:03:55,589][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:03:56,094][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:03:56,601][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:03:57,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10271 tokens. [2025-11-13 05:03:57,981][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.47%, ΔTime: 00:00:33 [2025-11-13 05:03:58,643][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:03:58,645][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:03:58,646][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:03:59,568][__main__][INFO] - Iteration 454 took 1m 8s (46.26% Gen, 52.39% Train). Generation: 31s, Training: 35s. Estimated remaining time: 49h 59m 14s. Estimated total time: 56h 53m 32s. Time estimates for 10 more iterations: 11m 22s, 100 more iterations: 1h 53m 47s, 500 more iterations: 9h 28m 55s. [2025-11-13 05:03:59,570][__main__][INFO] - Starting iteration 454. [2025-11-13 05:04:00,059][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 05:04:00,059][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:04:30,368][__main__][INFO] - Number of regex retries in iteration 454: 0 [2025-11-13 05:04:30,369][__main__][INFO] - agents played in iteration 454 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:04:31,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:04:31,219][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:04:31,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:04:31,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:04:31,267][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:04:31,268][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:04:32,045][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:04:32,511][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:04:33,025][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:04:33,530][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:04:34,038][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:04:34,543][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:04:35,054][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:04:35,559][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:04:36,067][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:04:36,576][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:04:37,082][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:04:37,594][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:04:38,102][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:04:38,607][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:04:39,128][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:04:39,635][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:04:40,149][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:04:40,655][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:04:41,162][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:04:41,672][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:04:42,177][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:04:42,681][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:04:43,186][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:04:43,691][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:04:44,196][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:04:44,705][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:04:45,213][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:04:45,755][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:04:46,264][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:04:46,796][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:04:47,305][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:04:47,817][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:04:48,328][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:04:48,832][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:04:49,339][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:04:49,850][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:04:50,355][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:04:50,869][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:04:51,379][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:04:51,890][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:04:52,401][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:04:52,914][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:04:53,438][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:04:53,945][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:04:54,453][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:04:54,967][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:04:55,475][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:04:55,985][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:04:56,500][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:04:57,007][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:04:57,538][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:04:58,046][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:04:58,556][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:04:59,057][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:04:59,564][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:05:00,072][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:05:00,577][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:05:01,084][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:05:01,596][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:05:02,107][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:05:02,616][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:05:03,122][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:05:03,630][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:05:04,143][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:05:04,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10166 tokens. [2025-11-13 05:05:05,486][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.26%, Current % of VRAM taken: 58.50%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 05:05:06,216][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:05:06,218][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:05:06,220][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:05:07,268][__main__][INFO] - Iteration 455 took 1m 7s (45.10% Gen, 53.34% Train). Generation: 30s, Training: 35s. Estimated remaining time: 49h 5m 5s. Estimated total time: 56h 0m 31s. Time estimates for 10 more iterations: 11m 12s, 100 more iterations: 1h 52m 1s, 500 more iterations: 9h 20m 5s. [2025-11-13 05:05:07,271][__main__][INFO] - Starting iteration 455. [2025-11-13 05:05:07,744][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 05:05:07,745][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:05:37,145][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:05:38,562][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:05:41,374][__main__][INFO] - Number of regex retries in iteration 455: 2 [2025-11-13 05:05:41,375][__main__][INFO] - agents played in iteration 455 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:05:42,234][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:05:42,257][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:05:42,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:05:42,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:05:42,302][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:05:42,303][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:05:43,100][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:05:43,566][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:05:44,084][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:05:44,593][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:05:45,101][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:05:45,610][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:05:46,118][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:05:46,623][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:05:47,130][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:05:47,636][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:05:48,145][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:05:48,668][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:05:49,175][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:05:49,684][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:05:50,190][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:05:50,698][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:05:51,206][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:05:51,712][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:05:52,218][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:05:52,725][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:05:53,230][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:05:53,741][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:05:54,245][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:05:54,750][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:05:55,254][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:05:55,759][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:05:56,265][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:05:56,771][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:05:57,276][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:05:57,783][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:05:58,288][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:05:58,793][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:05:59,299][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:05:59,807][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:06:00,316][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:06:00,823][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:06:01,328][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:06:01,834][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:06:02,338][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:06:02,842][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:06:03,359][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:06:03,865][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:06:04,378][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:06:04,884][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:06:05,394][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:06:05,897][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:06:06,402][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:06:06,907][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:06:07,418][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:06:07,924][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:06:08,432][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:06:08,935][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:06:09,440][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:06:09,943][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:06:10,454][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:06:10,971][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:06:11,474][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:06:11,984][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:06:12,491][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:06:12,994][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:06:13,506][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:06:14,007][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:06:14,512][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:06:15,018][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:06:15,523][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10112 tokens. [2025-11-13 05:06:16,370][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.06%, Current % of VRAM taken: 58.30%, Block Peak % of device VRAM: 62.11%, ΔTime: 00:00:33 [2025-11-13 05:06:17,007][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:06:17,008][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:06:17,010][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:06:17,910][__main__][INFO] - Iteration 456 took 1m 10s (47.93% Gen, 50.79% Train). Generation: 33s, Training: 35s. Estimated remaining time: 51h 31m 43s. Estimated total time: 58h 28m 19s. Time estimates for 10 more iterations: 11m 41s, 100 more iterations: 1h 56m 56s, 500 more iterations: 9h 44m 43s. [2025-11-13 05:06:17,912][__main__][INFO] - Starting iteration 456. [2025-11-13 05:06:18,395][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 05:06:18,396][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:06:36,355][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:06:36,522][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:06:44,452][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:06:48,750][__main__][INFO] - Number of regex retries in iteration 456: 3 [2025-11-13 05:06:48,751][__main__][INFO] - agents played in iteration 456 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:06:49,550][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:06:49,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:06:49,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:06:49,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:06:49,618][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:06:49,618][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:06:50,432][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:06:50,892][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:06:51,401][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:06:51,909][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:06:52,414][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:06:52,922][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:06:53,427][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:06:53,936][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:06:54,448][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:06:54,956][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:06:55,463][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:06:55,969][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:06:56,470][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:06:56,981][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:06:57,486][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:06:57,997][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:06:58,501][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:06:59,002][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:06:59,511][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:07:00,015][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:07:00,523][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:07:01,027][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:07:01,533][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:07:02,043][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:07:02,548][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:07:03,055][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:07:03,569][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:07:04,078][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:07:04,584][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:07:05,093][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:07:05,599][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:07:06,110][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:07:06,614][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:07:07,117][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:07:07,625][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:07:08,130][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:07:08,636][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:07:09,141][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:07:09,646][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:07:10,153][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:07:10,653][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:07:11,157][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:07:11,662][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:07:12,172][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:07:12,679][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:07:13,187][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:07:13,691][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:07:14,195][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:07:14,703][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:07:15,214][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:07:15,721][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:07:16,226][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:07:16,733][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:07:17,239][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:07:17,744][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:07:18,250][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:07:18,759][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:07:19,278][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:07:19,785][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:07:20,291][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:07:20,799][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:07:22,258][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:07:23,401][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:07:23,908][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:07:24,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10240 tokens. [2025-11-13 05:07:25,278][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:34 [2025-11-13 05:07:25,961][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:07:25,963][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:07:25,965][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:07:26,882][__main__][INFO] - Iteration 457 took 1m 8s (44.32% Gen, 54.34% Train). Generation: 30s, Training: 37s. Estimated remaining time: 50h 6m 40s. Estimated total time: 57h 4m 25s. Time estimates for 10 more iterations: 11m 24s, 100 more iterations: 1h 54m 8s, 500 more iterations: 9h 30m 44s. [2025-11-13 05:07:26,886][__main__][INFO] - Starting iteration 457. [2025-11-13 05:07:27,374][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 05:07:27,374][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:07:55,125][__main__][INFO] - Number of regex retries in iteration 457: 0 [2025-11-13 05:07:55,126][__main__][INFO] - agents played in iteration 457 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:07:56,099][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:07:56,123][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:07:56,147][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:07:56,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:07:56,170][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:07:56,172][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:07:56,956][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:07:57,422][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:07:57,930][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:07:58,434][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:07:58,944][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:07:59,449][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:07:59,953][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:08:00,455][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:08:00,961][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:08:01,474][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:08:01,983][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:08:02,499][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:08:03,006][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:08:03,513][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:08:04,021][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:08:04,532][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:08:05,040][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:08:05,544][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:08:06,051][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:08:06,560][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:08:07,065][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:08:07,570][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:08:08,074][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:08:08,579][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:08:09,109][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:08:09,616][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:08:10,119][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:08:10,628][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:08:11,137][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:08:11,642][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:08:12,151][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:08:12,657][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:08:13,163][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:08:13,666][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:08:14,170][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:08:14,675][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:08:15,180][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:08:15,702][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:08:16,207][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:08:16,712][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:08:17,217][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:08:17,720][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:08:18,227][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:08:18,731][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:08:19,242][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:08:19,747][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:08:20,254][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:08:20,762][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:08:21,271][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:08:21,803][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:08:22,316][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:08:22,821][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:08:23,331][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:08:23,839][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:08:24,345][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:08:24,854][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:08:25,362][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:08:25,867][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:08:26,384][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:08:26,891][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:08:27,402][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:08:27,911][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:08:28,414][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:08:28,919][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:08:29,426][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10107 tokens. [2025-11-13 05:08:30,256][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 05:08:31,027][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:08:31,029][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:08:31,030][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:08:32,072][__main__][INFO] - Iteration 458 took 1m 4s (42.89% Gen, 55.49% Train). Generation: 27s, Training: 35s. Estimated remaining time: 46h 56m 7s. Estimated total time: 53h 54m 57s. Time estimates for 10 more iterations: 10m 46s, 100 more iterations: 1h 47m 49s, 500 more iterations: 8h 59m 9s. [2025-11-13 05:08:32,074][__main__][INFO] - Starting iteration 458. [2025-11-13 05:08:32,550][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 05:08:32,551][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:08:58,978][__main__][INFO] - Number of regex retries in iteration 458: 0 [2025-11-13 05:08:58,979][__main__][INFO] - agents played in iteration 458 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:08:59,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:08:59,814][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:08:59,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:08:59,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:08:59,861][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:08:59,861][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:09:00,661][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:09:01,123][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:09:01,632][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:09:02,136][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:09:02,637][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:09:03,143][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:09:03,648][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:09:04,153][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:09:04,656][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:09:05,159][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:09:05,665][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:09:06,176][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:09:06,687][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:09:07,206][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:09:07,712][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:09:08,217][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:09:08,725][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:09:09,233][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:09:09,737][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:09:10,242][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:09:10,746][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:09:11,250][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:09:11,755][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:09:12,262][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:09:12,767][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:09:13,271][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:09:13,804][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:09:14,310][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:09:14,814][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:09:15,322][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:09:15,824][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:09:16,329][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:09:16,833][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:09:17,339][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:09:17,844][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:09:18,347][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:09:18,855][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:09:19,360][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:09:19,866][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:09:20,390][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:09:20,896][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:09:21,410][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:09:21,916][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:09:22,424][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:09:22,933][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:09:23,437][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:09:23,939][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:09:24,445][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:09:24,951][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:09:25,462][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:09:25,969][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:09:26,478][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:09:26,996][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:09:27,503][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:09:28,011][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:09:28,519][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:09:29,028][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:09:29,535][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:09:31,397][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:09:32,201][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:09:32,705][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:09:33,221][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:09:33,727][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:09:34,231][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:09:34,736][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10128 tokens. [2025-11-13 05:09:35,600][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:34 [2025-11-13 05:09:36,281][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:09:36,283][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:09:36,285][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:09:37,209][__main__][INFO] - Iteration 459 took 1m 4s (40.87% Gen, 57.70% Train). Generation: 26s, Training: 37s. Estimated remaining time: 46h 53m 1s. Estimated total time: 53h 52m 57s. Time estimates for 10 more iterations: 10m 46s, 100 more iterations: 1h 47m 45s, 500 more iterations: 8h 58m 49s. [2025-11-13 05:09:37,211][__main__][INFO] - Starting iteration 459. [2025-11-13 05:09:37,690][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 05:09:37,691][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:10:07,612][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:10:12,907][__main__][INFO] - Number of regex retries in iteration 459: 1 [2025-11-13 05:10:12,908][__main__][INFO] - agents played in iteration 459 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:10:13,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:10:13,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:10:13,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:10:13,884][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:10:13,885][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:10:13,886][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:10:14,693][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:10:15,170][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:10:15,683][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:10:16,201][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:10:16,707][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:10:17,214][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:10:17,717][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:10:18,222][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:10:18,731][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:10:19,244][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:10:19,750][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:10:20,254][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:10:20,761][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:10:21,268][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:10:21,786][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:10:22,293][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:10:22,801][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:10:23,315][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:10:23,820][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:10:24,331][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:10:24,836][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:10:25,342][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:10:25,849][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:10:26,355][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:10:26,858][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:10:27,361][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:10:27,864][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:10:28,372][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:10:28,880][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:10:29,386][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:10:29,893][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:10:30,397][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:10:30,920][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:10:31,429][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:10:31,934][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:10:32,439][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:10:32,947][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:10:33,458][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:10:33,962][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:10:34,469][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:10:34,981][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:10:35,487][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:10:35,992][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:10:36,495][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:10:36,998][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:10:37,515][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:10:38,021][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:10:38,522][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:10:39,030][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:10:39,533][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:10:40,051][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:10:40,560][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:10:41,068][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:10:41,576][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:10:42,082][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:10:42,590][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:10:43,097][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:10:43,603][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:10:44,109][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:10:44,611][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:10:45,132][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:10:45,633][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:10:46,135][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:10:46,639][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:10:47,144][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10138 tokens. [2025-11-13 05:10:47,946][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.06%, Current % of VRAM taken: 58.31%, Block Peak % of device VRAM: 62.17%, ΔTime: 00:00:33 [2025-11-13 05:10:48,712][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:10:48,714][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:10:48,715][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:10:49,743][__main__][INFO] - Iteration 460 took 1m 12s (48.88% Gen, 49.70% Train). Generation: 35s, Training: 35s. Estimated remaining time: 53h 1m 31s. Estimated total time: 60h 2m 39s. Time estimates for 10 more iterations: 12m 0s, 100 more iterations: 2h 0m 5s, 500 more iterations: 10h 0m 26s. [2025-11-13 05:10:49,745][__main__][INFO] - Starting iteration 460. [2025-11-13 05:10:50,238][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 45 and human policies 1. [2025-11-13 05:10:50,239][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:11:10,905][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:11:11,909][__main__][INFO] - Number of regex retries in iteration 460: 1 [2025-11-13 05:11:11,910][__main__][INFO] - agents played in iteration 460 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:11:12,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:11:12,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:11:12,779][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:11:12,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:11:12,803][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:11:12,804][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:11:13,613][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:11:14,077][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:11:14,587][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:11:15,094][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:11:15,605][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:11:16,107][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:11:16,632][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:11:17,135][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:11:17,644][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:11:18,148][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:11:18,653][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:11:19,163][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:11:19,671][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:11:20,176][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:11:20,683][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:11:21,187][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:11:21,693][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:11:22,200][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:11:22,706][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:11:23,214][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:11:23,717][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:11:24,221][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:11:24,728][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:11:25,232][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:11:25,741][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:11:26,246][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:11:26,753][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:11:27,259][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:11:27,774][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:11:28,287][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:11:28,794][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:11:29,300][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:11:29,809][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:11:30,316][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:11:30,824][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:11:31,331][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:11:31,839][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:11:32,353][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:11:32,882][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:11:33,400][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:11:33,911][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:11:34,420][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:11:34,952][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:11:35,457][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:11:35,968][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:11:36,564][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:11:39,017][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:11:39,651][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:11:40,159][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:11:40,669][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:11:41,181][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:11:41,690][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:11:42,197][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:11:42,705][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:11:43,213][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:11:43,717][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:11:44,236][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:11:44,741][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:11:45,252][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:11:45,759][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:11:46,264][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:11:46,773][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:11:47,275][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:11:47,777][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:11:48,281][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10152 tokens. [2025-11-13 05:11:49,087][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.11%, ΔTime: 00:00:35 [2025-11-13 05:11:49,737][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:11:49,739][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:11:49,741][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:11:51,644][__main__][INFO] - Iteration 461 took 1m 1s (35.29% Gen, 61.61% Train). Generation: 21s, Training: 37s. Estimated remaining time: 44h 8m 8s. Estimated total time: 51h 10m 18s. Time estimates for 10 more iterations: 10m 14s, 100 more iterations: 1h 42m 20s, 500 more iterations: 8h 31m 43s. [2025-11-13 05:11:51,647][__main__][INFO] - Starting iteration 461. [2025-11-13 05:11:52,138][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 05:11:52,139][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:12:28,533][__main__][INFO] - Number of regex retries in iteration 461: 0 [2025-11-13 05:12:28,534][__main__][INFO] - agents played in iteration 461 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:12:29,398][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:12:29,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:12:29,451][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:12:29,474][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:12:29,475][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:12:29,476][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:12:30,301][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:12:30,766][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:12:31,279][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:12:31,790][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:12:32,298][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:12:32,804][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:12:33,309][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:12:33,820][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:12:34,339][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:12:34,845][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:12:35,349][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:12:35,864][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:12:36,370][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:12:36,876][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:12:37,380][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:12:37,886][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:12:38,395][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:12:38,905][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:12:39,414][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:12:39,923][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:12:40,430][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:12:40,950][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:12:41,455][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:12:41,960][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:12:42,472][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:12:42,974][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:12:43,479][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:12:43,985][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:12:44,494][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:12:45,000][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:12:45,509][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:12:46,016][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:12:46,525][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:12:47,031][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:12:47,547][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:12:48,051][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:12:48,557][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:12:49,067][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:12:49,572][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:12:50,081][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:12:50,587][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:12:51,094][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:12:51,599][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:12:52,108][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:12:52,614][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:12:53,128][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:12:53,636][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:12:54,146][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:12:54,646][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:12:55,149][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:12:55,659][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:12:56,160][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:12:56,662][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:12:57,163][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:12:57,666][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:12:58,170][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:12:58,673][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:12:59,176][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:12:59,687][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:13:00,190][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:13:00,697][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:13:01,205][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:13:01,712][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:13:02,220][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:13:02,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10115 tokens. [2025-11-13 05:13:03,552][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.17%, ΔTime: 00:00:33 [2025-11-13 05:13:04,305][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:13:04,307][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:13:04,309][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:13:05,374][__main__][INFO] - Iteration 462 took 1m 13s (49.70% Gen, 48.85% Train). Generation: 36s, Training: 35s. Estimated remaining time: 53h 58m 25s. Estimated total time: 61h 1m 49s. Time estimates for 10 more iterations: 12m 12s, 100 more iterations: 2h 2m 3s, 500 more iterations: 10h 10m 18s. [2025-11-13 05:13:05,376][__main__][INFO] - Starting iteration 462. [2025-11-13 05:13:05,878][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 05:13:05,879][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:13:26,840][__main__][INFO] - Number of regex retries in iteration 462: 0 [2025-11-13 05:13:26,840][__main__][INFO] - agents played in iteration 462 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:13:27,639][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:13:27,662][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:13:27,685][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:13:27,707][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:13:27,707][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:13:27,708][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:13:28,483][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:13:28,944][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:13:29,453][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:13:29,966][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:13:30,471][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:13:30,978][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:13:31,485][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:13:31,989][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:13:32,499][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:13:33,003][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:13:33,506][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:13:34,011][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:13:34,514][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:13:35,031][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:13:35,539][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:13:36,048][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:13:36,568][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:13:37,073][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:13:37,578][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:13:38,087][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:13:38,596][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:13:39,101][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:13:39,609][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:13:40,114][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:13:40,635][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:13:41,152][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:13:41,665][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:13:42,171][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:13:42,678][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:13:43,189][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:13:43,778][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:13:46,207][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:13:46,729][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:13:47,233][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:13:47,738][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:13:48,260][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:13:48,771][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:13:49,280][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:13:49,791][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:13:50,299][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:13:50,809][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:13:51,321][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:13:51,828][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:13:52,337][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:13:52,845][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:13:53,351][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:13:53,858][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:13:54,366][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:13:54,875][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:13:55,379][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:13:55,890][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:13:56,402][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:13:56,909][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:13:57,424][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:13:57,930][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:13:58,435][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:13:58,957][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:13:59,461][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:13:59,970][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:14:00,476][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:14:00,981][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:14:01,492][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:14:01,999][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:14:02,504][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:14:03,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10137 tokens. [2025-11-13 05:14:03,860][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:35 [2025-11-13 05:14:04,496][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:14:04,498][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:14:04,499][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:14:05,459][__main__][INFO] - Iteration 463 took 59s (35.18% Gen, 63.21% Train). Generation: 20s, Training: 37s. Estimated remaining time: 42h 34m 39s. Estimated total time: 49h 39m 3s. Time estimates for 10 more iterations: 9m 55s, 100 more iterations: 1h 39m 18s, 500 more iterations: 8h 16m 30s. [2025-11-13 05:14:05,461][__main__][INFO] - Starting iteration 463. [2025-11-13 05:14:05,963][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 05:14:05,963][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:14:39,098][__main__][INFO] - Number of regex retries in iteration 463: 0 [2025-11-13 05:14:39,099][__main__][INFO] - agents played in iteration 463 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:14:39,893][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:14:39,917][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:14:39,940][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:14:39,964][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:14:39,964][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:14:39,965][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:14:40,793][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:14:41,259][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:14:41,773][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:14:42,289][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:14:42,794][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:14:43,300][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:14:43,810][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:14:44,316][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:14:44,823][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:14:45,332][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:14:45,836][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:14:46,341][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:14:46,847][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:14:47,368][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:14:47,873][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:14:48,382][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:14:48,890][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:14:49,397][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:14:49,905][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:14:50,410][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:14:50,914][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:14:51,421][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:14:51,928][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:14:52,436][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:14:52,945][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:14:53,454][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:14:53,973][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:14:54,479][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:14:54,984][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:14:55,493][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:14:55,997][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:14:56,502][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:14:57,009][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:14:57,515][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:14:58,024][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:14:58,529][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:14:59,033][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:14:59,542][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:15:00,048][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:15:00,573][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:15:01,080][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:15:01,584][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:15:02,089][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:15:02,600][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:15:03,109][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:15:03,615][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:15:04,119][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:15:04,624][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:15:05,130][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:15:05,652][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:15:06,156][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:15:06,662][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:15:07,173][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:15:07,676][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:15:08,181][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:15:08,687][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:15:09,193][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:15:09,699][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:15:10,202][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:15:10,706][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:15:11,213][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:15:11,721][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:15:12,226][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:15:12,730][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:15:13,233][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10079 tokens. [2025-11-13 05:15:14,080][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.05%, Current % of VRAM taken: 58.30%, Block Peak % of device VRAM: 62.10%, ΔTime: 00:00:33 [2025-11-13 05:15:14,830][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:15:14,831][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:15:14,833][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:15:15,831][__main__][INFO] - Iteration 464 took 1m 9s (47.43% Gen, 51.14% Train). Generation: 33s, Training: 35s. Estimated remaining time: 51h 7m 51s. Estimated total time: 58h 13m 26s. Time estimates for 10 more iterations: 11m 38s, 100 more iterations: 1h 56m 26s, 500 more iterations: 9h 42m 14s. [2025-11-13 05:15:15,833][__main__][INFO] - Starting iteration 464. [2025-11-13 05:15:16,328][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 05:15:16,328][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:15:39,076][__main__][INFO] - Number of regex retries in iteration 464: 0 [2025-11-13 05:15:39,076][__main__][INFO] - agents played in iteration 464 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:15:39,994][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:15:40,017][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:15:40,040][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:15:40,063][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:15:40,064][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:15:40,065][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:15:40,868][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:15:41,334][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:15:41,845][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:15:42,351][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:15:42,861][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:15:43,368][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:15:43,874][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:15:44,387][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:15:44,898][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:15:45,409][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:15:45,961][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:15:47,875][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:15:48,646][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:15:49,162][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:15:49,671][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:15:50,177][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:15:50,684][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:15:51,188][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:15:51,693][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:15:52,199][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:15:52,703][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:15:53,212][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:15:53,717][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:15:54,221][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:15:54,728][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:15:55,238][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:15:55,762][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:15:56,270][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:15:56,779][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:15:57,286][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:15:57,794][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:15:58,301][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:15:58,807][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:15:59,315][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:15:59,827][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:16:00,332][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:16:00,838][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:16:01,348][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:16:01,853][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:16:02,372][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:16:02,883][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:16:03,393][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:16:03,902][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:16:04,413][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:16:04,920][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:16:05,427][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:16:05,932][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:16:06,445][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:16:06,951][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:16:07,465][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:16:07,971][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:16:08,476][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:16:08,986][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:16:09,492][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:16:10,000][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:16:10,503][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:16:11,010][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:16:11,519][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:16:12,029][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:16:12,534][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:16:13,061][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:16:13,568][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:16:14,077][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:16:14,587][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:16:15,095][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10197 tokens. [2025-11-13 05:16:15,963][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.18%, ΔTime: 00:00:35 [2025-11-13 05:16:16,606][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:16:16,608][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:16:16,610][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:16:17,523][__main__][INFO] - Iteration 465 took 1m 1s (37.17% Gen, 61.33% Train). Generation: 22s, Training: 37s. Estimated remaining time: 43h 53m 10s. Estimated total time: 50h 59m 46s. Time estimates for 10 more iterations: 10m 11s, 100 more iterations: 1h 41m 59s, 500 more iterations: 8h 29m 57s. [2025-11-13 05:16:17,525][__main__][INFO] - Starting iteration 465. [2025-11-13 05:16:18,016][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 05:16:18,016][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:16:45,496][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:16:46,421][__main__][INFO] - Number of regex retries in iteration 465: 1 [2025-11-13 05:16:46,421][__main__][INFO] - agents played in iteration 465 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:16:47,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:16:47,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:16:47,333][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:16:47,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:16:47,356][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:16:47,357][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:16:48,154][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:16:48,625][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:16:49,134][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:16:49,643][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:16:50,148][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:16:50,664][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:16:51,185][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:16:51,691][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:16:52,202][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:16:52,707][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:16:53,213][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:16:53,719][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:16:54,226][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:16:54,732][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:16:55,251][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:16:55,757][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:16:56,280][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:16:56,788][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:16:57,293][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:16:57,801][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:16:58,311][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:16:58,827][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:16:59,331][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:16:59,837][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:17:00,355][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:17:00,867][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:17:01,388][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:17:01,898][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:17:02,405][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:17:02,914][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:17:03,423][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:17:03,932][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:17:04,436][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:17:04,947][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:17:05,456][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:17:05,961][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:17:06,470][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:17:06,976][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:17:07,482][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:17:07,994][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:17:08,499][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:17:09,005][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:17:09,512][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:17:10,018][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:17:10,524][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:17:11,029][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:17:11,535][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:17:12,054][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:17:12,557][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:17:13,072][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:17:13,577][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:17:14,087][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:17:14,595][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:17:15,105][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:17:15,616][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:17:16,124][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:17:16,634][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:17:17,143][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:17:17,652][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:17:18,157][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:17:18,663][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:17:19,171][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:17:19,689][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:17:20,193][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:17:20,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10112 tokens. [2025-11-13 05:17:21,492][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 05:17:22,220][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:17:22,224][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:17:22,225][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:17:23,244][__main__][INFO] - Iteration 466 took 1m 5s (43.55% Gen, 54.89% Train). Generation: 28s, Training: 35s. Estimated remaining time: 47h 13m 45s. Estimated total time: 54h 21m 26s. Time estimates for 10 more iterations: 10m 52s, 100 more iterations: 1h 48m 42s, 500 more iterations: 9h 3m 34s. [2025-11-13 05:17:23,246][__main__][INFO] - Starting iteration 466. [2025-11-13 05:17:23,720][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 05:17:23,720][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:17:38,917][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:17:43,952][__main__][INFO] - Number of regex retries in iteration 466: 1 [2025-11-13 05:17:43,952][__main__][INFO] - agents played in iteration 466 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:17:44,837][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:17:44,861][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:17:44,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:17:44,908][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:17:44,908][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:17:44,909][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:17:45,620][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:17:46,085][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:17:46,591][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:17:47,093][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:17:47,596][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:17:48,098][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:17:48,599][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:17:49,104][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:17:49,608][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:17:50,118][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:17:50,622][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:17:51,127][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:17:51,644][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:17:52,148][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:17:52,656][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:17:53,162][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:17:53,670][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:17:54,197][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:17:54,704][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:17:55,212][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:17:55,849][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:17:57,782][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:17:58,492][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:17:59,006][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:17:59,516][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:18:00,020][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:18:00,525][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:18:01,031][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:18:01,536][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:18:02,051][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:18:02,558][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:18:03,068][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:18:03,574][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:18:04,078][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:18:04,591][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:18:05,098][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:18:05,607][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:18:06,116][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:18:06,621][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:18:07,127][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:18:07,631][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:18:08,136][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:18:08,646][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:18:09,151][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:18:09,654][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:18:10,159][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:18:10,667][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:18:11,188][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:18:11,693][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:18:12,199][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:18:12,704][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:18:13,209][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:18:13,718][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:18:14,228][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:18:14,731][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:18:15,239][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:18:15,744][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:18:16,250][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:18:16,762][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:18:17,274][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:18:17,786][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:18:18,291][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:18:18,799][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:18:19,305][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:18:19,816][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10036 tokens. [2025-11-13 05:18:20,680][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:35 [2025-11-13 05:18:21,325][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:18:21,327][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:18:21,330][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:18:22,260][__main__][INFO] - Iteration 467 took 58s (34.56% Gen, 63.85% Train). Generation: 20s, Training: 37s. Estimated remaining time: 41h 38m 22s. Estimated total time: 48h 47m 2s. Time estimates for 10 more iterations: 9m 45s, 100 more iterations: 1h 37m 34s, 500 more iterations: 8h 7m 50s. [2025-11-13 05:18:22,262][__main__][INFO] - Starting iteration 467. [2025-11-13 05:18:22,748][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 05:18:22,748][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:18:52,707][__main__][INFO] - Number of regex retries in iteration 467: 0 [2025-11-13 05:18:52,708][__main__][INFO] - agents played in iteration 467 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:18:53,486][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:18:53,510][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:18:53,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:18:53,557][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:18:53,557][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:18:53,559][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:18:54,270][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:18:54,732][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:18:55,240][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:18:55,743][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:18:56,258][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:18:56,764][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:18:57,269][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:18:57,778][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:18:58,284][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:18:58,795][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:18:59,297][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:18:59,808][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:19:00,316][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:19:00,823][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:19:01,332][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:19:01,836][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:19:02,340][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:19:02,852][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:19:03,360][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:19:03,865][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:19:04,382][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:19:04,891][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:19:05,405][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:19:05,910][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:19:06,416][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:19:06,923][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:19:07,427][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:19:07,932][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:19:08,436][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:19:08,937][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:19:09,442][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:19:09,946][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:19:10,451][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:19:10,958][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:19:11,466][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:19:11,983][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:19:12,490][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:19:12,995][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:19:13,499][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:19:14,005][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:19:14,514][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:19:15,017][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:19:15,522][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:19:16,031][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:19:16,536][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:19:17,041][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:19:17,548][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:19:18,054][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:19:18,564][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:19:19,074][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:19:19,578][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:19:20,085][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:19:20,590][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:19:21,112][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:19:21,616][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:19:22,124][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:19:22,634][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:19:23,142][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:19:23,650][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:19:24,158][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:19:24,667][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:19:25,176][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:19:25,684][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:19:26,195][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:19:26,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10123 tokens. [2025-11-13 05:19:27,566][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 05:19:28,287][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:19:28,289][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:19:28,291][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:19:29,296][__main__][INFO] - Iteration 468 took 1m 6s (45.02% Gen, 53.47% Train). Generation: 29s, Training: 35s. Estimated remaining time: 48h 17m 39s. Estimated total time: 55h 27m 27s. Time estimates for 10 more iterations: 11m 5s, 100 more iterations: 1h 50m 54s, 500 more iterations: 9h 14m 34s. [2025-11-13 05:19:29,299][__main__][INFO] - Starting iteration 468. [2025-11-13 05:19:29,780][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 05:19:29,780][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:19:51,263][__main__][INFO] - Number of regex retries in iteration 468: 0 [2025-11-13 05:19:51,264][__main__][INFO] - agents played in iteration 468 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:19:52,034][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:19:52,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:19:52,079][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:19:52,102][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:19:52,103][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:19:52,104][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:19:52,834][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:19:53,291][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:19:53,813][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:19:54,318][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:19:54,825][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:19:55,325][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:19:55,827][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:19:56,330][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:19:56,830][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:19:57,329][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:19:57,833][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:19:58,333][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:19:58,832][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:19:59,334][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:19:59,836][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:20:00,342][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:20:00,843][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:20:01,347][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:20:01,852][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:20:02,359][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:20:02,864][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:20:04,936][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:20:05,442][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:20:05,946][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:20:06,451][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:20:06,958][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:20:07,470][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:20:07,978][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:20:08,483][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:20:08,993][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:20:09,521][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:20:10,026][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:20:10,530][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:20:11,040][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:20:11,545][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:20:12,051][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:20:12,556][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:20:13,064][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:20:13,579][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:20:14,089][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:20:14,604][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:20:15,112][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:20:15,621][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:20:16,139][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:20:16,645][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:20:17,159][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:20:17,667][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:20:18,174][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:20:18,690][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:20:19,200][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:20:19,708][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:20:20,219][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:20:20,731][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:20:21,240][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:20:21,746][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:20:22,253][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:20:22,761][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:20:23,269][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:20:23,775][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:20:24,285][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:20:24,793][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:20:25,317][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:20:25,824][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:20:26,332][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:20:26,841][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10056 tokens. [2025-11-13 05:20:27,724][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:34 [2025-11-13 05:20:28,376][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:20:28,378][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:20:28,380][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:20:29,332][__main__][INFO] - Iteration 469 took 59s (36.07% Gen, 62.33% Train). Generation: 21s, Training: 37s. Estimated remaining time: 42h 26m 50s. Estimated total time: 49h 37m 38s. Time estimates for 10 more iterations: 9m 55s, 100 more iterations: 1h 39m 15s, 500 more iterations: 8h 16m 16s. [2025-11-13 05:20:29,335][__main__][INFO] - Starting iteration 469. [2025-11-13 05:20:29,812][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 05:20:29,813][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:20:56,476][__main__][INFO] - Number of regex retries in iteration 469: 0 [2025-11-13 05:20:56,477][__main__][INFO] - agents played in iteration 469 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:20:57,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:20:57,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:20:57,306][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:20:57,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:20:57,329][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:20:57,331][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:20:58,069][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:20:58,532][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:20:59,038][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:20:59,542][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:21:00,041][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:21:00,541][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:21:01,041][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:21:01,542][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:21:02,046][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:21:02,551][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:21:03,054][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:21:03,561][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:21:04,062][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:21:04,566][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:21:05,069][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:21:05,574][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:21:06,086][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:21:06,588][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:21:07,089][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:21:07,590][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:21:08,091][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:21:08,604][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:21:09,106][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:21:09,613][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:21:10,114][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:21:10,623][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:21:11,132][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:21:11,636][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:21:12,140][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:21:12,645][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:21:13,152][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:21:13,659][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:21:14,164][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:21:14,670][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:21:15,178][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:21:15,684][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:21:16,191][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:21:16,711][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:21:17,215][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:21:17,728][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:21:18,232][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:21:18,736][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:21:19,241][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:21:19,747][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:21:20,253][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:21:20,761][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:21:21,266][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:21:21,775][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:21:22,282][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:21:22,788][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:21:23,298][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:21:23,826][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:21:24,346][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:21:24,854][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:21:25,359][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:21:25,866][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:21:26,376][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:21:26,884][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:21:27,395][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:21:27,901][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:21:28,414][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:21:28,920][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:21:29,433][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:21:29,939][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:21:30,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10104 tokens. [2025-11-13 05:21:31,326][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 05:21:32,082][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:21:32,084][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:21:32,086][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:21:33,059][__main__][INFO] - Iteration 470 took 1m 3s (42.16% Gen, 56.30% Train). Generation: 26s, Training: 35s. Estimated remaining time: 45h 30m 32s. Estimated total time: 52h 42m 24s. Time estimates for 10 more iterations: 10m 32s, 100 more iterations: 1h 45m 24s, 500 more iterations: 8h 47m 4s. [2025-11-13 05:21:33,062][__main__][INFO] - Starting iteration 470. [2025-11-13 05:21:33,546][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 46 and human policies 1. [2025-11-13 05:21:33,546][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:21:45,830][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:21:55,864][__main__][INFO] - Number of regex retries in iteration 470: 1 [2025-11-13 05:21:55,865][__main__][INFO] - agents played in iteration 470 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:21:56,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:21:56,733][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:21:56,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:21:56,782][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:21:56,782][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:21:56,783][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:21:57,514][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:21:57,973][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:21:58,483][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:21:58,987][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:21:59,490][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:22:00,002][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:22:00,502][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:22:01,007][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:22:01,507][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:22:02,008][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:22:02,531][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:22:03,031][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:22:03,535][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:22:04,043][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:22:04,545][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:22:05,051][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:22:05,552][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:22:06,053][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:22:06,554][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:22:07,053][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:22:07,553][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:22:08,053][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:22:08,553][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:22:09,055][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:22:09,555][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:22:10,055][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:22:10,564][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:22:11,067][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:22:11,568][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:22:12,070][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:22:12,572][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:22:13,079][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:22:13,581][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:22:14,085][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:22:14,589][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:22:15,096][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:22:15,600][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:22:16,104][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:22:16,608][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:22:17,122][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:22:17,631][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:22:18,143][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:22:18,651][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:22:19,157][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:22:19,669][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:22:20,180][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:22:20,692][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:22:21,204][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:22:21,717][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:22:22,235][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:22:22,740][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:22:23,247][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:22:23,756][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:22:24,264][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:22:24,773][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:22:25,280][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:22:25,787][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:22:26,319][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:22:26,823][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:22:27,341][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:22:27,849][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:22:28,360][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:22:28,871][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:22:29,378][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:22:29,890][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10058 tokens. [2025-11-13 05:22:30,770][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.16%, ΔTime: 00:00:33 [2025-11-13 05:22:31,434][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:22:31,435][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:22:31,437][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:22:33,281][__main__][INFO] - Iteration 471 took 59s (37.36% Gen, 59.55% Train). Generation: 22s, Training: 35s. Estimated remaining time: 42h 33m 57s. Estimated total time: 49h 46m 48s. Time estimates for 10 more iterations: 9m 57s, 100 more iterations: 1h 39m 33s, 500 more iterations: 8h 17m 48s. [2025-11-13 05:22:33,283][__main__][INFO] - Starting iteration 471. [2025-11-13 05:22:33,770][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 05:22:33,771][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:22:57,028][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 1, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:22:59,285][__main__][INFO] - Number of regex retries in iteration 471: 1 [2025-11-13 05:22:59,285][__main__][INFO] - agents played in iteration 471 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:23:00,125][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:23:00,160][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:23:00,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:23:00,214][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:23:00,215][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:23:00,216][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:23:00,932][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:23:01,392][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:23:01,907][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:23:02,411][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:23:02,919][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:23:03,424][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:23:03,927][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:23:04,434][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:23:04,944][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:23:05,449][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:23:05,953][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:23:06,456][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:23:06,962][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:23:07,472][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:23:07,975][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:23:08,482][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:23:08,983][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:23:09,487][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:23:09,995][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:23:10,499][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:23:11,015][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:23:11,523][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:23:12,028][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:23:12,532][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:23:13,037][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:23:13,550][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:23:14,051][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:23:14,557][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:23:15,057][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:23:15,558][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:23:16,062][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:23:16,563][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:23:17,064][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:23:17,571][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:23:18,076][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:23:18,579][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:23:19,087][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:23:19,590][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:23:20,094][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:23:20,598][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:23:21,101][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:23:21,605][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:23:22,109][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:23:22,627][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:23:23,130][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:23:23,631][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:23:24,135][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:23:24,643][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:23:25,154][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:23:25,663][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:23:26,167][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:23:26,675][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:23:27,189][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:23:27,699][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:23:28,204][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:23:28,709][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:23:29,229][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:23:29,737][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:23:30,240][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:23:30,745][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:23:31,255][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:23:31,766][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:23:32,272][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:23:32,778][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:23:33,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10158 tokens. [2025-11-13 05:23:34,224][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.18%, ΔTime: 00:00:33 [2025-11-13 05:23:34,958][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:23:34,960][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:23:34,961][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:23:35,927][__main__][INFO] - Iteration 472 took 1m 2s (41.04% Gen, 57.40% Train). Generation: 25s, Training: 35s. Estimated remaining time: 44h 33m 59s. Estimated total time: 51h 47m 53s. Time estimates for 10 more iterations: 10m 21s, 100 more iterations: 1h 43m 35s, 500 more iterations: 8h 37m 58s. [2025-11-13 05:23:35,929][__main__][INFO] - Starting iteration 472. [2025-11-13 05:23:36,415][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 05:23:36,415][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:23:53,172][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:24:00,519][__main__][INFO] - Number of regex retries in iteration 472: 1 [2025-11-13 05:24:00,520][__main__][INFO] - agents played in iteration 472 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:24:01,360][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:24:01,382][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:24:01,407][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:24:01,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:24:01,430][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:24:01,431][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:24:02,188][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:24:02,646][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:24:03,158][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:24:03,661][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:24:04,169][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:24:04,674][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:24:05,174][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:24:05,675][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:24:06,181][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:24:06,682][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:24:07,183][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:24:07,689][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:24:08,190][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:24:08,692][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:24:09,193][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:24:09,697][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:24:10,205][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:24:10,708][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:24:11,216][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:24:11,718][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:24:12,219][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:24:12,737][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:24:13,239][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:24:13,741][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:24:14,243][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:24:14,746][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:24:15,253][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:24:15,753][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:24:16,258][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:24:16,763][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:24:17,271][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:24:17,777][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:24:18,278][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:24:18,778][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:24:19,286][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:24:19,789][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:24:20,290][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:24:20,796][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:24:21,296][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:24:21,801][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:24:22,303][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:24:22,809][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:24:23,325][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:24:23,828][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:24:24,332][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:24:24,835][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:24:25,339][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:24:25,856][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:24:26,357][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:24:26,859][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:24:27,362][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:24:27,871][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:24:28,379][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:24:28,888][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:24:29,393][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:24:29,898][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:24:30,403][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:24:30,909][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:24:31,414][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:24:31,920][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:24:32,433][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:24:32,937][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:24:33,451][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:24:33,955][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:24:34,459][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10063 tokens. [2025-11-13 05:24:35,348][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.01%, Current % of VRAM taken: 58.26%, Block Peak % of device VRAM: 62.15%, ΔTime: 00:00:33 [2025-11-13 05:24:36,008][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:24:36,010][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:24:36,013][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:24:36,922][__main__][INFO] - Iteration 473 took 1m 0s (39.84% Gen, 58.66% Train). Generation: 24s, Training: 35s. Estimated remaining time: 43h 10m 28s. Estimated total time: 50h 25m 23s. Time estimates for 10 more iterations: 10m 5s, 100 more iterations: 1h 40m 50s, 500 more iterations: 8h 24m 13s. [2025-11-13 05:24:36,924][__main__][INFO] - Starting iteration 473. [2025-11-13 05:24:37,406][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 05:24:37,407][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:25:02,553][__main__][INFO] - Number of regex retries in iteration 473: 0 [2025-11-13 05:25:02,554][__main__][INFO] - agents played in iteration 473 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:25:03,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:25:03,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:25:03,444][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:25:03,483][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:25:03,484][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:25:03,485][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:25:04,274][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:25:04,734][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:25:05,246][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:25:05,749][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:25:06,257][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:25:06,756][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:25:07,256][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:25:07,764][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:25:08,264][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:25:08,778][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:25:09,287][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:25:09,795][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:25:10,307][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:25:10,812][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:25:11,322][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:25:11,839][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:25:12,345][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:25:12,859][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:25:13,362][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:25:13,870][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:25:14,377][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:25:14,880][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:25:15,384][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:25:15,887][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:25:16,392][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:25:16,901][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:25:17,402][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:25:17,905][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:25:18,411][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:25:18,915][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:25:19,416][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:25:19,928][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:25:20,430][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:25:20,938][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:25:21,439][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:25:21,941][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:25:22,447][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:25:22,948][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:25:23,471][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:25:23,971][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:25:24,473][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:25:24,974][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:25:25,480][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:25:25,982][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:25:26,486][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:25:26,989][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:25:27,493][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:25:27,995][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:25:28,501][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:25:29,005][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:25:29,510][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:25:30,018][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:25:30,523][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:25:31,028][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:25:31,537][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:25:32,043][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:25:32,549][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:25:33,058][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:25:33,566][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:25:34,081][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:25:34,587][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:25:35,098][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:25:35,608][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:25:36,117][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:25:36,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10126 tokens. [2025-11-13 05:25:37,474][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:33 [2025-11-13 05:25:38,229][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:25:38,230][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:25:38,232][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:25:39,294][__main__][INFO] - Iteration 474 took 1m 1s (40.63% Gen, 57.65% Train). Generation: 25s, Training: 35s. Estimated remaining time: 44h 18m 29s. Estimated total time: 51h 34m 27s. Time estimates for 10 more iterations: 10m 18s, 100 more iterations: 1h 43m 8s, 500 more iterations: 8h 35m 44s. [2025-11-13 05:25:39,296][__main__][INFO] - Starting iteration 474. [2025-11-13 05:25:39,770][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 05:25:39,771][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:26:10,413][__main__][INFO] - Number of regex retries in iteration 474: 0 [2025-11-13 05:26:10,415][__main__][INFO] - agents played in iteration 474 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:26:11,291][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:26:11,314][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:26:11,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:26:11,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:26:11,362][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:26:11,363][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:26:12,159][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:26:12,618][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:26:13,128][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:26:13,632][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:26:14,134][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:26:14,635][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:26:15,138][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:26:15,639][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:26:16,148][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:26:16,652][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:26:17,152][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:26:17,657][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:26:18,157][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:26:18,676][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:26:19,176][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:26:19,675][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:26:20,176][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:26:20,680][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:26:21,182][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:26:21,690][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:26:22,195][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:26:22,705][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:26:23,208][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:26:23,717][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:26:24,217][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:26:24,718][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:26:25,222][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:26:25,724][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:26:26,225][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:26:26,730][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:26:27,234][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:26:27,735][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:26:28,237][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:26:28,739][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:26:29,239][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:26:29,746][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:26:30,246][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:26:30,751][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:26:31,250][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:26:31,758][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:26:32,261][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:26:32,764][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:26:33,269][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:26:33,771][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:26:34,273][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:26:34,775][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:26:35,278][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:26:35,798][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:26:36,305][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:26:36,810][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:26:37,317][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:26:37,827][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:26:38,333][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:26:38,837][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:26:39,341][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:26:39,854][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:26:40,364][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:26:40,872][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:26:41,376][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:26:41,879][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:26:42,397][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:26:42,902][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:26:43,408][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:26:43,914][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:26:44,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10048 tokens. [2025-11-13 05:26:45,322][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.32%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:33 [2025-11-13 05:26:45,983][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:26:45,984][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:26:45,987][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:26:46,871][__main__][INFO] - Iteration 475 took 1m 7s (45.67% Gen, 53.01% Train). Generation: 30s, Training: 35s. Estimated remaining time: 48h 37m 58s. Estimated total time: 55h 55m 4s. Time estimates for 10 more iterations: 11m 11s, 100 more iterations: 1h 51m 50s, 500 more iterations: 9h 19m 10s. [2025-11-13 05:26:46,874][__main__][INFO] - Starting iteration 475. [2025-11-13 05:26:47,365][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 05:26:47,366][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:27:09,682][__main__][INFO] - Number of regex retries in iteration 475: 0 [2025-11-13 05:27:09,683][__main__][INFO] - agents played in iteration 475 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:27:10,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:27:10,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:27:10,650][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:27:10,674][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:27:10,674][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:27:10,674][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:27:11,477][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:27:11,937][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:27:12,445][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:27:12,950][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:27:13,456][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:27:13,959][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:27:14,460][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:27:14,963][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:27:15,464][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:27:15,967][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:27:16,482][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:27:16,984][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:27:17,487][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:27:17,991][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:27:18,496][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:27:19,008][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:27:19,508][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:27:20,013][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:27:20,516][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:27:21,017][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:27:21,524][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:27:22,028][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:27:22,531][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:27:23,034][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:27:23,535][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:27:24,041][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:27:24,541][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:27:25,043][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:27:25,550][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:27:26,052][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:27:26,554][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:27:27,059][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:27:27,563][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:27:28,074][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:27:28,575][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:27:29,077][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:27:29,596][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:27:30,096][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:27:30,596][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:27:31,096][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:27:31,599][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:27:32,108][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:27:32,607][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:27:33,108][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:27:33,606][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:27:34,107][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:27:34,609][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:27:35,107][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:27:35,607][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:27:36,113][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:27:36,615][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:27:37,117][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:27:37,620][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:27:38,122][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:27:38,627][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:27:39,131][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:27:39,638][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:27:40,148][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:27:40,655][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:27:41,168][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:27:41,676][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:27:42,183][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:27:42,691][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:27:43,197][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:27:43,706][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10148 tokens. [2025-11-13 05:27:44,597][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 05:27:45,353][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:27:45,355][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:27:45,358][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:27:46,379][__main__][INFO] - Iteration 476 took 59s (37.82% Gen, 60.45% Train). Generation: 22s, Training: 35s. Estimated remaining time: 41h 52m 38s. Estimated total time: 49h 10m 42s. Time estimates for 10 more iterations: 9m 50s, 100 more iterations: 1h 38m 21s, 500 more iterations: 8h 11m 47s. [2025-11-13 05:27:46,381][__main__][INFO] - Starting iteration 476. [2025-11-13 05:27:46,941][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 05:27:46,942][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:28:12,237][__main__][INFO] - Number of regex retries in iteration 476: 0 [2025-11-13 05:28:12,239][__main__][INFO] - agents played in iteration 476 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:28:13,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:28:13,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:28:13,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:28:13,153][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:28:13,154][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:28:13,155][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:28:13,896][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:28:14,367][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:28:14,885][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:28:15,402][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:28:15,907][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:28:16,411][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:28:16,917][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:28:17,423][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:28:17,930][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:28:18,434][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:28:18,935][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:28:19,443][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:28:19,943][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:28:20,451][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:28:20,953][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:28:21,458][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:28:21,957][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:28:22,458][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:28:22,958][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:28:23,459][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:28:23,961][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:28:24,461][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:28:24,963][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:28:25,464][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:28:25,979][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:28:26,480][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:28:26,985][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:28:27,490][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:28:27,993][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:28:28,505][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:28:29,007][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:28:29,514][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:28:30,014][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:28:30,517][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:28:31,022][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:28:31,530][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:28:32,034][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:28:32,536][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:28:33,045][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:28:33,549][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:28:34,051][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:28:34,552][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:28:35,054][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:28:35,558][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:28:36,059][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:28:36,563][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:28:37,063][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:28:37,564][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:28:38,065][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:28:38,568][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:28:39,070][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:28:39,568][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:28:40,068][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:28:40,570][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:28:41,069][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:28:41,569][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:28:42,080][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:28:42,581][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:28:43,095][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:28:43,600][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:28:44,106][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:28:44,611][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:28:45,115][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:28:45,618][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:28:46,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10105 tokens. [2025-11-13 05:28:46,985][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:33 [2025-11-13 05:28:47,646][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:28:47,648][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:28:47,650][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:28:48,547][__main__][INFO] - Iteration 477 took 1m 1s (41.06% Gen, 57.48% Train). Generation: 25s, Training: 35s. Estimated remaining time: 44h 1m 13s. Estimated total time: 51h 20m 20s. Time estimates for 10 more iterations: 10m 16s, 100 more iterations: 1h 42m 40s, 500 more iterations: 8h 33m 23s. [2025-11-13 05:28:48,550][__main__][INFO] - Starting iteration 477. [2025-11-13 05:28:49,029][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 05:28:49,029][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:29:14,131][__main__][INFO] - Number of regex retries in iteration 477: 0 [2025-11-13 05:29:14,132][__main__][INFO] - agents played in iteration 477 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:29:14,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:29:15,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:29:15,036][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:29:15,058][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:29:15,058][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:29:15,059][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:29:15,852][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:29:16,312][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:29:16,822][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:29:17,337][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:29:17,835][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:29:18,336][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:29:18,841][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:29:19,340][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:29:19,841][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:29:20,339][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:29:20,837][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:29:21,340][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:29:21,840][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:29:22,339][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:29:22,840][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:29:23,343][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:29:23,842][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:29:24,341][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:29:24,848][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:29:25,351][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:29:25,850][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:29:26,350][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:29:26,850][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:29:27,351][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:29:27,850][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:29:28,350][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:29:28,853][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:29:29,353][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:29:29,853][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:29:30,355][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:29:30,857][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:29:31,356][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:29:31,858][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:29:32,356][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:29:32,853][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:29:33,353][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:29:33,856][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:29:34,357][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:29:34,858][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:29:35,357][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:29:35,856][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:29:36,355][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:29:36,854][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:29:37,361][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:29:37,862][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:29:38,366][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:29:38,864][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:29:39,369][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:29:39,872][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:29:40,376][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:29:40,879][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:29:41,378][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:29:41,881][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:29:42,380][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:29:42,878][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:29:43,378][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:29:43,884][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:29:44,390][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:29:44,896][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:29:45,395][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:29:45,900][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:29:46,411][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:29:46,921][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:29:47,431][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:29:47,945][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10120 tokens. [2025-11-13 05:29:48,806][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.27%, Current % of VRAM taken: 58.51%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:32 [2025-11-13 05:29:49,566][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:29:49,567][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:29:49,569][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:29:50,646][__main__][INFO] - Iteration 478 took 1m 1s (40.74% Gen, 57.51% Train). Generation: 25s, Training: 35s. Estimated remaining time: 44h 0m 44s. Estimated total time: 51h 20m 53s. Time estimates for 10 more iterations: 10m 16s, 100 more iterations: 1h 42m 41s, 500 more iterations: 8h 33m 28s. [2025-11-13 05:29:50,648][__main__][INFO] - Starting iteration 478. [2025-11-13 05:29:51,157][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 05:29:51,158][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:30:14,069][__main__][INFO] - Number of regex retries in iteration 478: 0 [2025-11-13 05:30:14,071][__main__][INFO] - agents played in iteration 478 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:30:15,113][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:30:15,176][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:30:15,214][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:30:15,247][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:30:15,247][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:30:15,248][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:30:16,064][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:30:16,606][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:30:17,125][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:30:17,630][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:30:18,135][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:30:18,645][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:30:19,154][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:30:19,658][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:30:20,165][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:30:20,671][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:30:21,181][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:30:21,688][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:30:22,199][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:30:22,705][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:30:23,220][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:30:23,722][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:30:24,229][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:30:24,731][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:30:25,234][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:30:25,739][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:30:26,243][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:30:26,745][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:30:27,249][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:30:27,752][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:30:28,253][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:30:28,761][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:30:29,263][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:30:29,773][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:30:30,277][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:30:30,782][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:30:31,288][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:30:31,791][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:30:32,294][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:30:32,812][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:30:33,314][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:30:33,832][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:30:34,338][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:30:34,844][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:30:35,350][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:30:35,853][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:30:36,356][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:30:36,858][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:30:37,366][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:30:37,870][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:30:38,372][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:30:38,874][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:30:39,377][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:30:39,878][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:30:40,385][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:30:40,888][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:30:41,393][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:30:41,907][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:30:42,410][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:30:42,925][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:30:43,428][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:30:43,933][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:30:44,447][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:30:44,947][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:30:45,449][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:30:45,947][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:30:46,449][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:30:46,955][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:30:47,458][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:30:47,957][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:30:48,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10123 tokens. [2025-11-13 05:30:49,288][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.06%, Current % of VRAM taken: 58.31%, Block Peak % of device VRAM: 62.19%, ΔTime: 00:00:33 [2025-11-13 05:30:49,953][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:30:49,956][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:30:49,958][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:30:50,891][__main__][INFO] - Iteration 479 took 59s (38.36% Gen, 60.08% Train). Generation: 22s, Training: 35s. Estimated remaining time: 42h 25m 32s. Estimated total time: 49h 46m 41s. Time estimates for 10 more iterations: 9m 57s, 100 more iterations: 1h 39m 33s, 500 more iterations: 8h 17m 46s. [2025-11-13 05:30:50,893][__main__][INFO] - Starting iteration 479. [2025-11-13 05:30:51,410][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 05:30:51,411][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:31:07,015][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:31:16,744][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:31:18,238][__main__][INFO] - Number of regex retries in iteration 479: 2 [2025-11-13 05:31:18,239][__main__][INFO] - agents played in iteration 479 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:31:19,109][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:31:19,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:31:19,162][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:31:19,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:31:19,186][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:31:19,187][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:31:20,001][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:31:20,463][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:31:20,971][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:31:21,475][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:31:21,980][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:31:22,482][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:31:22,983][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:31:23,491][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:31:23,993][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:31:24,496][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:31:25,004][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:31:25,505][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:31:26,010][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:31:26,517][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:31:27,020][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:31:27,522][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:31:28,031][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:31:28,536][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:31:29,050][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:31:29,555][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:31:30,054][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:31:30,556][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:31:31,061][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:31:31,565][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:31:32,067][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:31:32,570][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:31:33,074][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:31:33,578][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:31:34,082][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:31:34,584][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:31:35,088][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:31:35,594][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:31:36,097][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:31:36,605][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:31:37,109][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:31:37,611][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:31:38,113][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:31:38,628][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:31:39,129][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:31:39,646][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:31:40,148][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:31:40,674][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:31:41,177][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:31:41,680][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:31:42,186][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:31:42,695][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:31:43,199][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:31:43,703][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:31:44,208][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:31:44,709][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:31:45,211][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:31:45,711][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:31:46,216][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:31:46,716][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:31:47,216][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:31:47,717][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:31:48,218][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:31:48,722][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:31:49,222][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:31:49,726][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:31:50,239][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:31:50,746][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:31:51,257][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:31:51,763][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:31:52,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10148 tokens. [2025-11-13 05:31:53,133][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:33 [2025-11-13 05:31:53,907][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:31:53,908][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:31:53,910][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:31:54,860][__main__][INFO] - Iteration 480 took 1m 3s (42.28% Gen, 56.22% Train). Generation: 26s, Training: 35s. Estimated remaining time: 45h 30m 18s. Estimated total time: 52h 52m 31s. Time estimates for 10 more iterations: 10m 34s, 100 more iterations: 1h 45m 45s, 500 more iterations: 8h 48m 45s. [2025-11-13 05:31:54,862][__main__][INFO] - Starting iteration 480. [2025-11-13 05:31:55,363][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 47 and human policies 1. [2025-11-13 05:31:55,363][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:32:19,899][__main__][INFO] - Number of regex retries in iteration 480: 0 [2025-11-13 05:32:19,901][__main__][INFO] - agents played in iteration 480 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:32:20,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:32:20,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:32:20,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:32:20,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:32:20,806][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:32:20,807][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:32:21,589][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:32:22,049][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:32:22,555][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:32:23,073][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:32:23,575][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:32:24,076][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:32:24,581][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:32:25,081][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:32:25,588][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:32:26,099][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:32:26,605][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:32:27,114][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:32:27,619][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:32:28,124][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:32:28,641][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:32:29,146][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:32:29,658][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:32:30,162][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:32:30,663][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:32:31,165][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:32:31,667][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:32:32,171][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:32:32,670][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:32:33,171][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:32:33,676][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:32:34,176][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:32:34,679][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:32:35,179][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:32:35,680][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:32:36,179][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:32:36,680][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:32:37,179][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:32:37,685][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:32:38,190][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:32:38,693][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:32:39,196][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:32:39,697][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:32:40,199][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:32:40,701][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:32:41,203][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:32:41,708][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:32:42,216][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:32:42,725][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:32:43,233][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:32:43,736][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:32:44,255][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:32:44,758][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:32:45,261][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:32:45,767][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:32:46,268][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:32:46,771][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:32:47,270][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:32:47,777][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:32:48,281][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:32:48,781][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:32:49,286][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:32:49,789][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:32:50,299][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:32:50,808][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:32:51,319][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:32:51,828][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:32:52,349][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:32:52,854][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:32:53,372][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:32:53,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10159 tokens. [2025-11-13 05:32:54,757][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:33 [2025-11-13 05:32:55,396][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:32:55,398][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:32:55,400][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:32:57,335][__main__][INFO] - Iteration 481 took 1m 1s (39.59% Gen, 57.28% Train). Generation: 24s, Training: 35s. Estimated remaining time: 44h 15m 21s. Estimated total time: 51h 38m 37s. Time estimates for 10 more iterations: 10m 19s, 100 more iterations: 1h 43m 17s, 500 more iterations: 8h 36m 26s. [2025-11-13 05:32:57,337][__main__][INFO] - Starting iteration 481. [2025-11-13 05:32:57,824][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 05:32:57,825][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:33:11,042][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:33:21,649][__main__][INFO] - Number of regex retries in iteration 481: 1 [2025-11-13 05:33:21,650][__main__][INFO] - agents played in iteration 481 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:33:22,512][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:33:22,540][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:33:22,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:33:22,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:33:22,591][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:33:22,591][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:33:23,339][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:33:23,798][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:33:24,304][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:33:24,810][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:33:25,321][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:33:25,826][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:33:26,327][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:33:26,831][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:33:27,332][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:33:27,853][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:33:28,355][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:33:28,866][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:33:29,385][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:33:29,894][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:33:30,396][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:33:30,902][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:33:31,404][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:33:31,908][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:33:32,414][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:33:32,914][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:33:33,416][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:33:33,920][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:33:34,424][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:33:34,929][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:33:35,430][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:33:35,944][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:33:36,446][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:33:36,951][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:33:37,455][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:33:37,956][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:33:38,469][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:33:38,971][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:33:39,474][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:33:39,981][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:33:40,486][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:33:40,991][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:33:41,497][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:33:41,998][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:33:42,504][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:33:43,006][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:33:43,508][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:33:44,012][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:33:44,537][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:33:45,039][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:33:45,541][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:33:46,043][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:33:46,564][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:33:47,068][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:33:47,573][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:33:48,077][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:33:48,576][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:33:49,075][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:33:49,582][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:33:50,082][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:33:50,583][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:33:51,082][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:33:51,588][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:33:52,094][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:33:52,596][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:33:53,099][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:33:53,599][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:33:54,101][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:33:54,609][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:33:55,114][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:33:55,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10125 tokens. [2025-11-13 05:33:56,482][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.06%, ΔTime: 00:00:33 [2025-11-13 05:33:57,261][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:33:57,263][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:33:57,266][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:33:58,200][__main__][INFO] - Iteration 482 took 1m 0s (39.46% Gen, 58.99% Train). Generation: 23s, Training: 35s. Estimated remaining time: 42h 54m 33s. Estimated total time: 50h 18m 50s. Time estimates for 10 more iterations: 10m 3s, 100 more iterations: 1h 40m 37s, 500 more iterations: 8h 23m 8s. [2025-11-13 05:33:58,202][__main__][INFO] - Starting iteration 482. [2025-11-13 05:33:58,683][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 05:33:58,684][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:34:22,474][__main__][INFO] - Number of regex retries in iteration 482: 0 [2025-11-13 05:34:22,476][__main__][INFO] - agents played in iteration 482 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:34:23,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:34:23,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:34:23,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:34:23,425][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:34:23,425][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:34:23,426][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:34:24,284][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:34:24,755][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:34:25,265][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:34:25,770][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:34:26,273][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:34:26,776][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:34:27,283][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:34:27,789][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:34:28,293][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:34:28,794][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:34:29,294][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:34:29,795][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:34:30,295][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:34:30,794][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:34:31,302][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:34:31,802][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:34:32,303][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:34:32,811][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:34:33,314][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:34:33,820][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:34:34,324][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:34:34,836][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:34:35,345][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:34:35,848][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:34:36,358][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:34:36,858][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:34:37,364][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:34:37,890][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:34:38,393][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:34:38,896][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:34:39,402][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:34:39,910][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:34:40,420][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:34:40,928][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:34:41,431][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:34:41,936][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:34:42,440][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:34:42,943][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:34:43,467][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:34:43,975][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:34:44,480][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:34:44,981][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:34:45,484][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:34:45,986][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:34:46,489][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:34:47,000][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:34:47,502][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:34:48,010][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:34:48,512][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:34:49,012][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:34:49,519][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:34:50,020][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:34:50,524][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:34:51,029][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:34:51,533][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:34:52,038][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:34:52,541][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:34:53,040][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:34:53,541][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:34:54,041][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:34:54,545][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:34:55,057][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:34:55,557][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:34:56,075][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:34:56,579][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10099 tokens. [2025-11-13 05:34:57,422][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 05:34:58,052][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:34:58,053][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:34:58,055][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:34:59,125][__main__][INFO] - Iteration 483 took 1m 0s (39.36% Gen, 58.86% Train). Generation: 23s, Training: 35s. Estimated remaining time: 42h 56m 51s. Estimated total time: 50h 22m 8s. Time estimates for 10 more iterations: 10m 4s, 100 more iterations: 1h 40m 44s, 500 more iterations: 8h 23m 41s. [2025-11-13 05:34:59,127][__main__][INFO] - Starting iteration 483. [2025-11-13 05:34:59,633][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 05:34:59,633][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:35:21,606][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:35:26,371][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:35:27,434][__main__][INFO] - Number of regex retries in iteration 483: 2 [2025-11-13 05:35:27,435][__main__][INFO] - agents played in iteration 483 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:35:28,278][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:35:28,307][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:35:28,333][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:35:28,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:35:28,357][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:35:28,358][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:35:29,208][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:35:29,672][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:35:30,190][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:35:30,698][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:35:31,200][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:35:31,706][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:35:32,211][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:35:32,719][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:35:33,221][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:35:33,722][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:35:34,238][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:35:34,746][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:35:35,249][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:35:35,756][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:35:36,259][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:35:36,763][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:35:37,266][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:35:37,770][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:35:38,275][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:35:38,779][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:35:39,282][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:35:39,785][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:35:40,293][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:35:40,819][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:35:41,325][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:35:41,832][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:35:42,335][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:35:42,837][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:35:43,342][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:35:43,843][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:35:44,342][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:35:44,845][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:35:45,349][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:35:45,874][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:35:46,379][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:35:46,881][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:35:47,383][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:35:47,889][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:35:48,395][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:35:48,915][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:35:49,419][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:35:49,919][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:35:50,420][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:35:50,921][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:35:51,427][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:35:51,930][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:35:52,431][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:35:52,939][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:35:53,439][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:35:53,946][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:35:54,448][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:35:54,950][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:35:55,455][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:35:55,957][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:35:56,460][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:35:56,963][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:35:57,465][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:35:57,968][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:35:58,469][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:35:58,970][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:35:59,476][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:35:59,979][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:36:00,483][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:36:00,997][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:36:01,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10170 tokens. [2025-11-13 05:36:02,256][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 05:36:02,993][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:36:02,994][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:36:02,996][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:36:04,013][__main__][INFO] - Iteration 484 took 1m 4s (43.18% Gen, 55.23% Train). Generation: 27s, Training: 35s. Estimated remaining time: 46h 12m 41s. Estimated total time: 53h 39m 4s. Time estimates for 10 more iterations: 10m 43s, 100 more iterations: 1h 47m 18s, 500 more iterations: 8h 56m 30s. [2025-11-13 05:36:04,015][__main__][INFO] - Starting iteration 484. [2025-11-13 05:36:04,516][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 05:36:04,516][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:36:18,441][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:36:29,783][__main__][INFO] - Number of regex retries in iteration 484: 1 [2025-11-13 05:36:29,783][__main__][INFO] - agents played in iteration 484 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:36:30,685][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:36:30,721][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:36:30,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:36:30,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:36:30,777][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:36:30,778][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:36:31,630][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:36:32,100][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:36:32,616][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:36:33,125][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:36:33,633][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:36:34,143][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:36:34,650][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:36:35,161][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:36:35,667][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:36:36,171][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:36:36,683][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:36:37,190][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:36:37,706][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:36:38,209][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:36:38,711][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:36:39,212][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:36:39,714][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:36:40,215][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:36:40,717][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:36:41,217][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:36:41,731][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:36:42,232][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:36:42,731][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:36:43,234][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:36:43,737][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:36:44,238][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:36:44,740][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:36:45,245][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:36:45,752][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:36:46,252][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:36:46,752][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:36:47,255][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:36:47,757][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:36:48,264][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:36:48,771][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:36:49,272][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:36:49,775][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:36:50,277][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:36:50,783][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:36:51,283][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:36:51,789][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:36:52,292][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:36:52,794][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:36:53,298][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:36:53,803][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:36:54,302][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:36:54,814][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:36:55,314][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:36:55,817][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:36:56,319][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:36:56,822][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:36:57,323][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:36:57,827][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:36:58,332][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:36:58,840][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:36:59,343][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:36:59,844][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:37:00,348][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:37:00,850][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:37:01,352][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:37:01,857][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:37:02,362][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:37:02,865][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:37:03,368][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:37:03,867][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10039 tokens. [2025-11-13 05:37:04,743][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.04%, Current % of VRAM taken: 58.29%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:33 [2025-11-13 05:37:05,408][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:37:05,410][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:37:05,412][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:37:06,321][__main__][INFO] - Iteration 485 took 1m 1s (40.88% Gen, 57.65% Train). Generation: 25s, Training: 35s. Estimated remaining time: 44h 2m 53s. Estimated total time: 51h 30m 18s. Time estimates for 10 more iterations: 10m 18s, 100 more iterations: 1h 43m 0s, 500 more iterations: 8h 35m 3s. [2025-11-13 05:37:06,323][__main__][INFO] - Starting iteration 485. [2025-11-13 05:37:06,895][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 05:37:06,896][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:37:35,099][__main__][INFO] - Number of regex retries in iteration 485: 0 [2025-11-13 05:37:35,100][__main__][INFO] - agents played in iteration 485 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:37:36,247][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:37:36,270][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:37:36,293][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:37:36,317][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:37:36,317][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:37:36,318][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:37:37,171][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:37:37,631][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:37:38,143][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:37:38,653][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:37:39,156][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:37:39,667][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:37:40,173][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:37:40,679][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:37:41,181][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:37:41,692][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:37:42,210][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:37:42,712][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:37:43,220][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:37:43,729][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:37:44,235][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:37:44,741][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:37:45,245][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:37:45,746][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:37:46,252][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:37:46,755][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:37:47,257][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:37:47,757][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:37:48,259][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:37:48,759][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:37:49,269][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:37:49,776][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:37:50,282][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:37:50,786][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:37:51,303][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:37:51,808][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:37:52,314][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:37:52,821][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:37:53,325][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:37:53,832][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:37:54,336][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:37:54,845][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:37:55,351][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:37:55,856][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:37:56,361][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:37:56,861][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:37:57,362][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:37:57,865][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:37:58,366][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:37:58,868][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:37:59,379][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:37:59,883][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:38:00,392][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:38:00,894][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:38:01,399][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:38:01,914][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:38:02,415][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:38:02,915][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:38:03,417][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:38:03,917][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:38:04,438][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:38:04,937][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:38:05,437][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:38:05,941][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:38:06,445][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:38:06,950][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:38:07,455][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:38:07,957][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:38:08,466][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:38:08,971][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:38:09,476][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10074 tokens. [2025-11-13 05:38:10,309][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.32%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:33 [2025-11-13 05:38:11,087][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:38:11,088][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:38:11,090][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:38:12,166][__main__][INFO] - Iteration 486 took 1m 5s (43.21% Gen, 55.14% Train). Generation: 28s, Training: 35s. Estimated remaining time: 46h 55m 3s. Estimated total time: 54h 23m 34s. Time estimates for 10 more iterations: 10m 52s, 100 more iterations: 1h 48m 47s, 500 more iterations: 9h 3m 55s. [2025-11-13 05:38:12,169][__main__][INFO] - Starting iteration 486. [2025-11-13 05:38:12,715][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 05:38:12,715][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:38:43,980][__main__][INFO] - Number of regex retries in iteration 486: 0 [2025-11-13 05:38:43,981][__main__][INFO] - agents played in iteration 486 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:38:44,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:38:44,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:38:44,879][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:38:44,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:38:44,907][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:38:44,907][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:38:45,723][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:38:46,185][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:38:46,705][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:38:47,212][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:38:47,738][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:38:48,245][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:38:48,755][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:38:49,264][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:38:49,768][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:38:50,276][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:38:50,782][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:38:51,291][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:38:51,806][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:38:52,311][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:38:52,829][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:38:53,338][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:38:53,841][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:38:54,354][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:38:54,862][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:38:55,366][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:38:55,876][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:38:56,379][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:38:56,886][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:38:57,389][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:38:57,891][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:38:58,401][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:38:58,904][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:38:59,409][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:38:59,910][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:39:00,412][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:39:00,930][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:39:01,434][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:39:01,937][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:39:02,443][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:39:02,946][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:39:03,453][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:39:03,960][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:39:04,467][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:39:04,973][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:39:05,478][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:39:05,984][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:39:06,486][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:39:06,989][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:39:07,500][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:39:08,002][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:39:08,510][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:39:09,014][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:39:09,519][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:39:10,025][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:39:10,526][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:39:11,038][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:39:11,543][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:39:12,049][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:39:12,564][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:39:13,065][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:39:13,567][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:39:14,075][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:39:14,578][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:39:15,082][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:39:15,587][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:39:16,093][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:39:16,607][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:39:17,113][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:39:17,618][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:39:18,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10109 tokens. [2025-11-13 05:39:18,972][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.38%, ΔTime: 00:00:33 [2025-11-13 05:39:19,614][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:39:19,616][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:39:19,618][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:39:20,512][__main__][INFO] - Iteration 487 took 1m 7s (46.12% Gen, 52.56% Train). Generation: 31s, Training: 35s. Estimated remaining time: 49h 0m 13s. Estimated total time: 56h 29m 51s. Time estimates for 10 more iterations: 11m 17s, 100 more iterations: 1h 52m 59s, 500 more iterations: 9h 24m 58s. [2025-11-13 05:39:20,514][__main__][INFO] - Starting iteration 487. [2025-11-13 05:39:20,999][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 05:39:21,000][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:39:42,516][__main__][INFO] - Number of regex retries in iteration 487: 0 [2025-11-13 05:39:42,516][__main__][INFO] - agents played in iteration 487 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:39:43,350][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:39:43,378][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:39:43,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:39:43,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:39:43,430][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:39:43,431][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:39:44,256][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:39:44,722][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:39:45,234][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:39:45,738][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:39:46,240][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:39:46,744][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:39:47,246][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:39:47,750][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:39:48,259][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:39:48,768][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:39:49,273][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:39:49,781][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:39:50,291][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:39:50,802][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:39:51,311][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:39:51,822][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:39:52,334][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:39:52,900][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:39:54,615][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:39:55,118][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:39:55,623][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:39:56,144][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:39:56,648][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:39:57,156][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:39:57,661][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:39:58,169][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:39:58,676][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:39:59,189][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:39:59,698][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:40:00,202][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:40:00,714][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:40:01,235][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:40:01,740][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:40:02,246][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:40:02,760][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:40:03,268][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:40:03,777][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:40:04,281][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:40:04,795][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:40:05,301][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:40:05,805][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:40:06,315][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:40:06,833][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:40:07,337][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:40:07,861][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:40:08,365][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:40:08,872][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:40:09,381][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:40:09,883][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:40:10,394][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:40:10,903][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:40:11,407][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:40:11,921][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:40:12,430][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:40:12,949][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:40:13,458][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:40:13,961][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:40:14,469][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:40:14,971][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:40:15,473][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:40:15,981][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:40:16,483][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:40:16,990][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:40:17,495][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:40:17,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10140 tokens. [2025-11-13 05:40:18,858][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.98%, Current % of VRAM taken: 58.23%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:34 [2025-11-13 05:40:19,510][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:40:19,523][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:40:19,525][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:40:20,482][__main__][INFO] - Iteration 488 took 59s (36.17% Gen, 62.22% Train). Generation: 21s, Training: 37s. Estimated remaining time: 42h 3m 31s. Estimated total time: 49h 34m 10s. Time estimates for 10 more iterations: 9m 54s, 100 more iterations: 1h 39m 8s, 500 more iterations: 8h 15m 41s. [2025-11-13 05:40:20,484][__main__][INFO] - Starting iteration 488. [2025-11-13 05:40:21,017][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 05:40:21,018][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:40:54,507][__main__][INFO] - Number of regex retries in iteration 488: 0 [2025-11-13 05:40:54,507][__main__][INFO] - agents played in iteration 488 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:40:55,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:40:55,358][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:40:55,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:40:55,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:40:55,405][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:40:55,406][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:40:56,216][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:40:56,683][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:40:57,193][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:40:57,702][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:40:58,210][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:40:58,716][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:40:59,224][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:40:59,734][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:41:00,241][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:41:00,743][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:41:01,247][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:41:01,764][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:41:02,272][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:41:02,785][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:41:03,288][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:41:03,799][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:41:04,310][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:41:04,813][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:41:05,318][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:41:05,823][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:41:06,328][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:41:06,833][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:41:07,342][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:41:07,848][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:41:08,362][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:41:08,867][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:41:09,381][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:41:09,884][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:41:10,389][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:41:10,894][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:41:11,400][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:41:11,904][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:41:12,408][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:41:12,909][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:41:13,413][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:41:13,916][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:41:14,417][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:41:14,917][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:41:15,419][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:41:15,921][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:41:16,425][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:41:16,927][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:41:17,429][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:41:17,931][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:41:18,434][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:41:18,934][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:41:19,439][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:41:19,941][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:41:20,450][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:41:20,956][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:41:21,465][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:41:21,968][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:41:22,477][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:41:22,977][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:41:23,480][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:41:23,994][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:41:24,499][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:41:25,005][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:41:25,516][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:41:26,020][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:41:26,529][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:41:27,042][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:41:27,548][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:41:28,078][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:41:28,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10107 tokens. [2025-11-13 05:41:29,460][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.22%, ΔTime: 00:00:33 [2025-11-13 05:41:30,208][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:41:30,209][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:41:30,210][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:41:31,257][__main__][INFO] - Iteration 489 took 1m 10s (47.68% Gen, 50.83% Train). Generation: 33s, Training: 35s. Estimated remaining time: 51h 0m 13s. Estimated total time: 58h 32m 3s. Time estimates for 10 more iterations: 11m 42s, 100 more iterations: 1h 57m 4s, 500 more iterations: 9h 45m 20s. [2025-11-13 05:41:31,260][__main__][INFO] - Starting iteration 489. [2025-11-13 05:41:31,733][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 05:41:31,734][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:41:50,700][__main__][INFO] - Number of regex retries in iteration 489: 0 [2025-11-13 05:41:50,702][__main__][INFO] - agents played in iteration 489 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:41:51,687][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:41:51,739][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:41:51,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:41:51,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:41:51,808][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:41:51,809][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:41:52,588][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:41:53,050][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:41:53,562][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:41:54,064][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:41:54,574][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:41:55,079][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:41:55,590][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:41:56,104][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:41:56,611][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:41:57,118][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:41:57,624][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:41:58,134][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:41:58,638][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:41:59,143][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:41:59,649][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:42:00,158][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:42:00,662][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:42:01,166][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:42:01,676][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:42:02,180][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:42:02,695][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:42:03,200][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:42:03,705][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:42:04,211][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:42:04,717][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:42:05,221][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:42:05,726][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:42:06,230][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:42:06,736][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:42:07,243][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:42:07,751][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:42:08,262][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:42:08,766][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:42:09,286][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:42:09,794][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:42:10,298][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:42:10,811][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:42:11,321][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:42:11,829][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:42:12,335][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:42:12,839][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:42:13,344][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:42:13,849][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:42:14,353][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:42:14,857][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:42:15,360][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:42:15,885][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:42:16,392][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:42:16,896][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:42:17,403][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:42:17,909][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:42:18,418][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:42:18,926][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:42:19,430][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:42:19,941][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:42:20,450][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:42:20,954][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:42:21,458][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:42:21,961][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:42:22,469][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:42:22,972][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:42:23,477][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:42:23,980][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:42:24,484][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:42:25,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10105 tokens. [2025-11-13 05:42:25,808][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.22%, ΔTime: 00:00:33 [2025-11-13 05:42:26,469][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:42:26,471][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:42:26,472][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:42:27,398][__main__][INFO] - Iteration 490 took 55s (34.07% Gen, 64.26% Train). Generation: 18s, Training: 35s. Estimated remaining time: 38h 50m 30s. Estimated total time: 46h 23m 15s. Time estimates for 10 more iterations: 9m 16s, 100 more iterations: 1h 32m 46s, 500 more iterations: 7h 43m 52s. [2025-11-13 05:42:27,400][__main__][INFO] - Starting iteration 490. [2025-11-13 05:42:27,879][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 48 and human policies 1. [2025-11-13 05:42:27,880][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:42:56,207][__main__][INFO] - Number of regex retries in iteration 490: 0 [2025-11-13 05:42:56,208][__main__][INFO] - agents played in iteration 490 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:42:57,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:42:57,068][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:42:57,095][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:42:57,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:42:57,120][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:42:57,121][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:42:57,911][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:42:58,375][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:42:58,885][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:42:59,388][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:42:59,893][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:43:00,412][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:43:00,915][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:43:01,422][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:43:01,938][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:43:02,439][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:43:02,942][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:43:03,444][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:43:03,947][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:43:04,453][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:43:04,956][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:43:05,461][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:43:05,961][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:43:06,466][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:43:06,975][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:43:07,480][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:43:07,987][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:43:08,516][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:43:09,016][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:43:09,522][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:43:10,029][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:43:10,537][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:43:11,045][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:43:11,551][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:43:12,059][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:43:12,563][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:43:13,068][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:43:13,578][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:43:14,089][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:43:14,601][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:43:15,113][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:43:15,620][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:43:16,136][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:43:16,645][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:43:17,154][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:43:17,659][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:43:18,164][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:43:18,670][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:43:19,176][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:43:19,686][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:43:20,197][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:43:20,702][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:43:21,210][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:43:21,718][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:43:22,223][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:43:22,735][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:43:23,240][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:43:23,744][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:43:24,250][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:43:24,760][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:43:25,264][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:43:25,768][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:43:26,271][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:43:26,776][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:43:27,277][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:43:27,778][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:43:28,282][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:43:28,789][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:43:29,289][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:43:29,793][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:43:30,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10083 tokens. [2025-11-13 05:43:31,072][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.19%, ΔTime: 00:00:33 [2025-11-13 05:43:31,814][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:43:31,816][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:43:31,817][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:43:33,820][__main__][INFO] - Iteration 491 took 1m 5s (42.96% Gen, 54.00% Train). Generation: 28s, Training: 35s. Estimated remaining time: 47h 23m 12s. Estimated total time: 54h 57m 4s. Time estimates for 10 more iterations: 10m 59s, 100 more iterations: 1h 49m 54s, 500 more iterations: 9h 9m 30s. [2025-11-13 05:43:33,822][__main__][INFO] - Starting iteration 491. [2025-11-13 05:43:34,292][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 05:43:34,292][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:43:45,749][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:43:58,228][__main__][INFO] - Number of regex retries in iteration 491: 1 [2025-11-13 05:43:58,230][__main__][INFO] - agents played in iteration 491 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:43:59,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:43:59,168][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:43:59,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:43:59,246][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:43:59,247][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:43:59,247][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:44:00,046][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:44:00,511][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:44:01,031][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:44:01,536][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:44:02,039][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:44:02,544][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:44:03,051][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:44:03,557][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:44:04,065][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:44:04,570][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:44:05,072][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:44:05,573][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:44:06,080][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:44:06,582][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:44:07,085][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:44:07,587][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:44:08,089][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:44:08,592][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:44:09,106][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:44:09,611][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:44:10,115][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:44:10,614][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:44:11,114][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:44:11,627][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:44:12,133][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:44:12,638][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:44:13,145][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:44:13,654][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:44:14,161][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:44:14,664][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:44:15,171][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:44:15,678][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:44:16,184][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:44:16,687][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:44:17,193][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:44:17,698][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:44:18,204][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:44:18,711][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:44:19,215][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:44:19,730][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:44:20,236][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:44:20,747][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:44:21,251][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:44:21,754][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:44:22,264][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:44:22,775][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:44:23,282][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:44:23,787][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:44:24,293][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:44:24,800][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:44:25,303][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:44:25,807][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:44:26,318][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:44:26,823][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:44:27,346][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:44:27,850][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:44:28,357][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:44:28,865][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:44:29,368][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:44:29,874][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:44:30,380][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:44:30,884][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:44:31,389][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:44:31,894][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:44:32,401][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10173 tokens. [2025-11-13 05:44:33,281][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 05:44:33,948][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:44:33,949][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:44:33,952][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:44:34,868][__main__][INFO] - Iteration 492 took 1m 0s (39.52% Gen, 58.97% Train). Generation: 23s, Training: 35s. Estimated remaining time: 42h 53m 57s. Estimated total time: 50h 28m 50s. Time estimates for 10 more iterations: 10m 5s, 100 more iterations: 1h 40m 57s, 500 more iterations: 8h 24m 48s. [2025-11-13 05:44:34,871][__main__][INFO] - Starting iteration 492. [2025-11-13 05:44:35,355][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 05:44:35,355][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:45:05,615][__main__][INFO] - Number of regex retries in iteration 492: 0 [2025-11-13 05:45:05,616][__main__][INFO] - agents played in iteration 492 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:45:06,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:45:06,434][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:45:06,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:45:06,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:45:06,481][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:45:06,482][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:45:07,270][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:45:07,735][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:45:08,249][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:45:08,762][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:45:09,272][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:45:09,785][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:45:10,293][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:45:10,799][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:45:11,304][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:45:11,808][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:45:12,310][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:45:12,816][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:45:13,317][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:45:13,819][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:45:14,324][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:45:14,837][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:45:15,338][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:45:15,843][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:45:16,351][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:45:16,852][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:45:17,362][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:45:17,862][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:45:18,367][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:45:18,875][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:45:19,381][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:45:19,882][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:45:20,383][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:45:20,889][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:45:21,396][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:45:21,898][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:45:22,398][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:45:22,901][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:45:23,407][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:45:23,917][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:45:24,423][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:45:24,925][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:45:25,440][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:45:25,947][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:45:26,461][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:45:26,962][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:45:27,466][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:45:27,973][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:45:28,478][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:45:28,980][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:45:29,497][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:45:30,008][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:45:30,516][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:45:31,022][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:45:31,528][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:45:32,045][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:45:32,552][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:45:33,066][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:45:33,575][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:45:34,081][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:45:34,588][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:45:35,094][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:45:35,601][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:45:36,105][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:45:36,613][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:45:37,118][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:45:37,622][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:45:38,127][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:45:38,646][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:45:39,153][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:45:39,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10192 tokens. [2025-11-13 05:45:40,487][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 05:45:41,239][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:45:41,240][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:45:41,242][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:45:42,301][__main__][INFO] - Iteration 493 took 1m 6s (45.20% Gen, 53.22% Train). Generation: 30s, Training: 35s. Estimated remaining time: 48h 11m 19s. Estimated total time: 55h 47m 20s. Time estimates for 10 more iterations: 11m 9s, 100 more iterations: 1h 51m 34s, 500 more iterations: 9h 17m 53s. [2025-11-13 05:45:42,303][__main__][INFO] - Starting iteration 493. [2025-11-13 05:45:42,774][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 05:45:42,775][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:46:04,559][__main__][INFO] - Number of regex retries in iteration 493: 0 [2025-11-13 05:46:04,560][__main__][INFO] - agents played in iteration 493 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:46:05,472][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:46:05,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:46:05,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:46:05,548][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:46:05,549][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:46:05,549][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:46:06,342][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:46:06,802][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:46:07,315][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:46:07,818][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:46:08,323][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:46:08,826][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:46:09,329][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:46:09,835][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:46:10,348][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:46:10,849][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:46:11,372][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:46:11,874][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:46:12,381][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:46:12,880][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:46:13,382][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:46:13,894][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:46:14,402][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:46:14,906][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:46:15,414][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:46:15,921][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:46:16,436][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:46:16,955][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:46:17,464][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:46:17,985][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:46:18,490][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:46:18,998][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:46:19,498][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:46:19,999][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:46:20,501][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:46:21,006][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:46:21,506][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:46:22,013][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:46:22,513][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:46:23,012][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:46:23,514][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:46:24,014][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:46:24,516][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:46:25,017][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:46:25,516][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:46:26,016][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:46:26,514][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:46:27,016][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:46:27,518][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:46:28,020][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:46:28,527][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:46:29,032][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:46:29,537][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:46:30,037][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:46:30,538][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:46:31,044][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:46:31,547][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:46:32,047][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:46:32,549][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:46:33,051][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:46:33,577][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:46:34,082][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:46:34,585][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:46:35,099][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:46:35,603][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:46:36,110][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:46:36,615][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:46:37,119][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:46:37,626][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:46:38,133][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:46:38,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10091 tokens. [2025-11-13 05:46:39,487][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.13%, ΔTime: 00:00:33 [2025-11-13 05:46:40,151][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:46:40,153][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:46:40,155][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:46:41,061][__main__][INFO] - Iteration 494 took 58s (37.37% Gen, 61.07% Train). Generation: 21s, Training: 35s. Estimated remaining time: 40h 57m 25s. Estimated total time: 48h 34m 25s. Time estimates for 10 more iterations: 9m 42s, 100 more iterations: 1h 37m 8s, 500 more iterations: 8h 5m 44s. [2025-11-13 05:46:41,063][__main__][INFO] - Starting iteration 494. [2025-11-13 05:46:41,576][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 05:46:41,576][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:47:09,507][__main__][INFO] - Number of regex retries in iteration 494: 0 [2025-11-13 05:47:09,508][__main__][INFO] - agents played in iteration 494 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:47:10,349][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:47:10,371][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:47:10,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:47:10,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:47:10,416][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:47:10,417][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:47:11,201][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:47:11,662][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:47:12,176][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:47:12,685][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:47:13,192][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:47:13,699][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:47:14,215][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:47:14,720][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:47:15,226][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:47:15,729][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:47:16,231][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:47:16,737][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:47:17,244][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:47:17,751][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:47:18,259][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:47:18,769][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:47:19,273][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:47:19,778][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:47:20,282][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:47:20,804][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:47:21,310][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:47:21,818][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:47:22,319][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:47:22,823][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:47:23,327][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:47:23,829][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:47:24,331][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:47:24,836][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:47:25,339][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:47:25,844][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:47:26,351][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:47:26,853][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:47:27,356][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:47:27,856][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:47:28,358][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:47:28,860][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:47:29,363][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:47:29,873][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:47:30,376][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:47:30,877][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:47:31,382][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:47:31,888][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:47:32,401][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:47:32,903][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:47:33,406][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:47:33,918][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:47:34,419][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:47:34,921][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:47:35,429][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:47:35,932][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:47:36,436][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:47:36,940][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:47:37,445][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:47:37,953][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:47:38,459][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:47:38,969][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:47:39,475][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:47:39,981][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:47:40,505][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:47:41,010][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:47:41,516][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:47:42,020][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:47:42,528][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:47:43,032][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:47:43,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10124 tokens. [2025-11-13 05:47:44,411][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:33 [2025-11-13 05:47:45,155][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:47:45,157][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:47:45,159][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:47:46,228][__main__][INFO] - Iteration 495 took 1m 4s (43.20% Gen, 55.14% Train). Generation: 27s, Training: 35s. Estimated remaining time: 46h 14m 35s. Estimated total time: 53h 52m 39s. Time estimates for 10 more iterations: 10m 46s, 100 more iterations: 1h 47m 45s, 500 more iterations: 8h 58m 46s. [2025-11-13 05:47:46,231][__main__][INFO] - Starting iteration 495. [2025-11-13 05:47:46,708][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 05:47:46,709][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:48:11,344][__main__][INFO] - Number of regex retries in iteration 495: 0 [2025-11-13 05:48:11,345][__main__][INFO] - agents played in iteration 495 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:48:12,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:48:12,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:48:12,241][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:48:12,264][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:48:12,265][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:48:12,266][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:48:13,097][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:48:13,570][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:48:14,083][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:48:14,601][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:48:15,105][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:48:15,614][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:48:16,122][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:48:16,631][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:48:17,135][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:48:17,640][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:48:18,149][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:48:18,657][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:48:19,163][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:48:19,668][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:48:20,183][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:48:20,686][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:48:21,202][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:48:21,710][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:48:22,217][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:48:22,728][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:48:23,238][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:48:23,741][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:48:24,248][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:48:24,756][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:48:25,264][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:48:25,767][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:48:26,270][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:48:26,784][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:48:27,287][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:48:27,799][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:48:28,300][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:48:28,802][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:48:29,305][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:48:29,814][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:48:30,317][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:48:30,818][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:48:31,320][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:48:31,827][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:48:32,331][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:48:32,835][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:48:33,340][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:48:33,843][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:48:34,344][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:48:34,850][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:48:35,350][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:48:35,853][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:48:36,354][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:48:36,853][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:48:37,353][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:48:37,858][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:48:38,364][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:48:38,862][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:48:39,361][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:48:39,874][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:48:40,374][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:48:40,879][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:48:41,385][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:48:41,886][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:48:42,409][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:48:42,915][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:48:43,419][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:48:43,923][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:48:44,426][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:48:44,931][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:48:45,436][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10006 tokens. [2025-11-13 05:48:46,296][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.14%, ΔTime: 00:00:33 [2025-11-13 05:48:46,934][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:48:46,936][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:48:46,938][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:48:47,862][__main__][INFO] - Iteration 496 took 1m 1s (40.29% Gen, 58.20% Train). Generation: 24s, Training: 35s. Estimated remaining time: 43h 18m 35s. Estimated total time: 50h 57m 41s. Time estimates for 10 more iterations: 10m 11s, 100 more iterations: 1h 41m 55s, 500 more iterations: 8h 29m 36s. [2025-11-13 05:48:47,865][__main__][INFO] - Starting iteration 496. [2025-11-13 05:48:48,336][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 05:48:48,336][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:49:20,105][__main__][INFO] - Number of regex retries in iteration 496: 0 [2025-11-13 05:49:20,106][__main__][INFO] - agents played in iteration 496 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:49:20,952][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:49:20,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:49:21,003][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:49:21,027][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:49:21,027][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:49:21,028][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:49:21,864][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:49:22,327][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:49:22,836][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:49:23,341][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:49:23,849][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:49:24,354][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:49:24,859][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:49:25,363][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:49:25,877][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:49:26,388][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:49:26,905][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:49:27,413][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:49:27,918][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:49:28,424][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:49:28,934][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:49:29,438][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:49:29,941][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:49:30,452][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:49:30,960][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:49:31,464][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:49:31,966][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:49:32,480][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:49:32,987][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:49:33,501][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:49:34,001][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:49:34,505][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:49:35,020][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:49:35,527][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:49:36,037][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:49:36,548][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:49:37,053][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:49:37,554][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:49:38,059][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:49:38,558][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:49:39,059][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:49:39,562][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:49:40,062][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:49:40,565][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:49:41,064][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:49:41,565][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:49:42,065][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:49:42,570][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:49:43,074][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:49:43,575][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:49:44,086][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:49:44,589][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:49:45,091][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:49:45,598][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:49:46,098][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:49:46,607][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:49:47,114][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:49:47,616][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:49:48,126][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:49:48,626][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:49:49,126][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:49:49,629][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:49:50,132][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:49:50,640][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:49:51,142][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:49:51,646][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:49:52,155][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:49:52,664][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:49:53,173][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:49:53,677][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:49:54,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10225 tokens. [2025-11-13 05:49:55,055][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.19%, ΔTime: 00:00:33 [2025-11-13 05:49:55,824][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:49:55,826][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:49:55,828][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:49:56,847][__main__][INFO] - Iteration 497 took 1m 8s (46.37% Gen, 52.14% Train). Generation: 31s, Training: 35s. Estimated remaining time: 49h 25m 22s. Estimated total time: 57h 5m 37s. Time estimates for 10 more iterations: 11m 25s, 100 more iterations: 1h 54m 11s, 500 more iterations: 9h 30m 56s. [2025-11-13 05:49:56,850][__main__][INFO] - Starting iteration 497. [2025-11-13 05:49:57,339][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 05:49:57,339][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:50:28,171][__main__][INFO] - Number of regex retries in iteration 497: 0 [2025-11-13 05:50:28,172][__main__][INFO] - agents played in iteration 497 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:50:29,031][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:50:29,057][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:50:29,081][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:50:29,104][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:50:29,105][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:50:29,106][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:50:29,934][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:50:30,397][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:50:30,903][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:50:31,419][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:50:31,926][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:50:32,430][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:50:32,932][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:50:33,441][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:50:33,944][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:50:34,448][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:50:34,952][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:50:35,464][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:50:35,973][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:50:36,477][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:50:36,982][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:50:37,490][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:50:38,008][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:50:38,512][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:50:39,032][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:50:39,537][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:50:40,040][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:50:40,546][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:50:41,048][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:50:41,554][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:50:42,064][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:50:42,568][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:50:43,071][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:50:43,576][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:50:44,078][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:50:44,585][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:50:45,088][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:50:45,591][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:50:46,094][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:50:46,599][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:50:47,115][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:50:47,617][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:50:48,118][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:50:48,628][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:50:49,132][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:50:49,638][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:50:50,142][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:50:50,670][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:50:51,177][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:50:51,679][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:50:52,182][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:50:52,692][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:50:53,199][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:50:53,710][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:50:54,215][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:50:54,717][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:50:55,224][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:50:55,725][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:50:56,233][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:50:56,734][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:50:57,241][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:50:57,747][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:50:58,256][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:50:58,764][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:50:59,273][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:50:59,779][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:51:00,287][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:51:00,797][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:51:01,309][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:51:01,822][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:51:02,328][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10215 tokens. [2025-11-13 05:51:03,171][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 05:51:03,807][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:51:03,808][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:51:03,811][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:51:04,767][__main__][INFO] - Iteration 498 took 1m 7s (45.73% Gen, 52.85% Train). Generation: 30s, Training: 35s. Estimated remaining time: 48h 30m 4s. Estimated total time: 56h 11m 27s. Time estimates for 10 more iterations: 11m 14s, 100 more iterations: 1h 52m 22s, 500 more iterations: 9h 21m 54s. [2025-11-13 05:51:04,769][__main__][INFO] - Starting iteration 498. [2025-11-13 05:51:05,259][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 05:51:05,260][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:51:21,276][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:51:23,557][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:51:33,441][__main__][INFO] - Number of regex retries in iteration 498: 2 [2025-11-13 05:51:33,441][__main__][INFO] - agents played in iteration 498 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:51:34,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:51:34,308][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:51:34,333][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:51:34,356][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:51:34,357][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:51:34,357][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:51:35,132][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:51:35,770][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:51:36,275][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:51:36,781][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:51:37,292][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:51:37,801][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:51:38,305][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:51:38,810][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:51:39,313][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:51:39,837][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:51:40,345][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:51:40,856][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:51:41,362][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:51:41,865][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:51:42,373][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:51:42,879][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:51:43,385][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:51:43,893][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:51:44,396][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:51:44,909][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:51:45,414][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:51:45,918][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:51:46,422][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:51:46,927][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:51:47,434][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:51:47,938][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:51:48,439][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:51:48,951][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:51:49,453][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:51:49,958][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:51:50,458][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:51:50,960][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:51:51,460][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:51:51,963][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:51:52,468][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:51:52,975][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:51:53,478][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:51:53,982][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:51:54,489][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:51:54,992][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:51:55,496][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:51:55,996][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:51:56,496][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:51:57,001][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:51:57,501][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:51:58,005][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:51:58,510][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:51:59,013][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:51:59,538][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:52:00,042][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:52:00,546][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:52:01,052][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:52:01,558][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:52:02,068][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:52:02,572][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:52:03,079][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:52:03,584][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:52:04,088][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:52:04,594][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:52:05,103][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:52:05,611][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:52:06,139][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:52:06,642][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:52:07,145][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:52:07,653][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10209 tokens. [2025-11-13 05:52:08,503][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.42%, ΔTime: 00:00:33 [2025-11-13 05:52:09,245][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:52:09,247][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:52:09,248][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:52:10,251][__main__][INFO] - Iteration 499 took 1m 4s (43.36% Gen, 55.09% Train). Generation: 28s, Training: 35s. Estimated remaining time: 46h 27m 8s. Estimated total time: 54h 9m 37s. Time estimates for 10 more iterations: 10m 49s, 100 more iterations: 1h 48m 19s, 500 more iterations: 9h 1m 36s. [2025-11-13 05:52:10,254][__main__][INFO] - Starting iteration 499. [2025-11-13 05:52:10,728][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 05:52:10,730][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:52:40,288][__main__][INFO] - Number of regex retries in iteration 499: 0 [2025-11-13 05:52:40,289][__main__][INFO] - agents played in iteration 499 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:52:41,156][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:52:41,184][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:52:41,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:52:41,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:52:41,234][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:52:41,235][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:52:42,048][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:52:42,513][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:52:43,035][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:52:43,544][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:52:44,053][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:52:44,559][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:52:45,065][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:52:45,572][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:52:46,076][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:52:46,580][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:52:47,087][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:52:47,590][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:52:48,095][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:52:48,598][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:52:49,102][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:52:49,613][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:52:50,118][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:52:50,628][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:52:51,133][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:52:51,647][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:52:52,158][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:52:52,661][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:52:53,169][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:52:53,671][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:52:54,176][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:52:54,684][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:52:55,191][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:52:55,693][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:52:56,200][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:52:56,704][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:52:57,209][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:52:57,721][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:52:58,225][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:52:58,741][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:52:59,242][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:52:59,743][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:53:00,246][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:53:00,748][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:53:01,257][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:53:01,757][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:53:02,262][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:53:02,767][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:53:03,268][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:53:03,770][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:53:04,275][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:53:04,779][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:53:05,288][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:53:05,792][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:53:06,299][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:53:06,805][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:53:07,313][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:53:07,826][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:53:08,331][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:53:08,838][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:53:09,350][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:53:09,857][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:53:10,368][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:53:10,875][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:53:11,380][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:53:11,886][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:53:12,390][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:53:12,903][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:53:13,410][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:53:13,915][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:53:14,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10072 tokens. [2025-11-13 05:53:15,282][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:33 [2025-11-13 05:53:15,926][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:53:15,928][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:53:15,930][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:53:16,915][__main__][INFO] - Iteration 500 took 1m 6s (44.66% Gen, 53.85% Train). Generation: 29s, Training: 35s. Estimated remaining time: 47h 25m 48s. Estimated total time: 55h 9m 23s. Time estimates for 10 more iterations: 11m 1s, 100 more iterations: 1h 50m 18s, 500 more iterations: 9h 11m 33s. [2025-11-13 05:53:16,918][__main__][INFO] - Starting iteration 500. [2025-11-13 05:53:17,398][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 49 and human policies 1. [2025-11-13 05:53:17,399][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:53:44,505][__main__][INFO] - Number of regex retries in iteration 500: 0 [2025-11-13 05:53:44,507][__main__][INFO] - agents played in iteration 500 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:53:45,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:53:45,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:53:45,444][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:53:45,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:53:45,467][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:53:45,467][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:53:46,286][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:53:46,751][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:53:47,266][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:53:47,776][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:53:48,281][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:53:48,787][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:53:49,294][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:53:49,800][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:53:50,303][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:53:50,807][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:53:51,309][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:53:51,819][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:53:52,334][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:53:52,842][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:53:53,362][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:53:53,868][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:53:54,379][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:53:54,886][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:53:55,392][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:53:55,898][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:53:56,405][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:53:56,910][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:53:57,418][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:53:57,926][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:53:58,431][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:53:58,941][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:53:59,443][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:53:59,957][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:54:00,457][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:54:00,960][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:54:01,462][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:54:01,969][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:54:02,475][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:54:02,981][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:54:03,485][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:54:03,988][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:54:04,493][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:54:04,994][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:54:05,495][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:54:05,996][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:54:06,501][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:54:07,007][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:54:07,510][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:54:08,014][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:54:08,515][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:54:09,016][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:54:09,519][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:54:10,021][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:54:10,529][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:54:11,029][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:54:11,528][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:54:12,039][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:54:12,543][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:54:13,058][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:54:13,558][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:54:14,060][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:54:14,571][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:54:15,076][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:54:15,583][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:54:16,087][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:54:16,599][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:54:17,104][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:54:17,611][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:54:18,116][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:54:18,628][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10139 tokens. [2025-11-13 05:54:19,493][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.07%, Current % of VRAM taken: 58.31%, Block Peak % of device VRAM: 62.16%, ΔTime: 00:00:33 [2025-11-13 05:54:20,166][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:54:20,168][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:54:20,170][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:54:22,037][__main__][INFO] - Iteration 501 took 1m 4s (41.93% Gen, 55.17% Train). Generation: 27s, Training: 35s. Estimated remaining time: 46h 7m 17s. Estimated total time: 53h 51m 58s. Time estimates for 10 more iterations: 10m 46s, 100 more iterations: 1h 47m 43s, 500 more iterations: 8h 58m 39s. [2025-11-13 05:54:22,039][__main__][INFO] - Starting iteration 501. [2025-11-13 05:54:22,516][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 05:54:22,517][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:54:50,011][__main__][INFO] - Number of regex retries in iteration 501: 0 [2025-11-13 05:54:50,012][__main__][INFO] - agents played in iteration 501 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:54:50,816][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:54:50,839][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:54:50,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:54:50,885][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:54:50,886][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:54:50,887][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:54:51,759][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:54:52,218][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:54:52,727][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:54:53,230][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:54:53,736][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:54:54,249][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:54:54,755][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:54:55,260][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:54:55,775][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:54:56,278][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:54:56,783][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:54:57,287][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:54:57,791][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:54:58,297][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:54:58,800][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:54:59,304][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:54:59,812][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:55:00,322][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:55:00,823][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:55:01,324][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:55:01,826][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:55:02,348][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:55:02,849][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:55:03,366][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:55:03,872][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:55:04,375][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:55:04,882][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:55:05,387][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:55:05,890][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:55:06,392][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:55:06,894][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:55:07,400][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:55:07,903][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:55:08,404][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:55:08,908][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:55:09,411][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:55:09,913][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:55:10,420][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:55:10,925][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:55:11,428][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:55:11,930][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:55:12,435][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:55:12,937][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:55:13,440][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:55:13,957][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:55:14,459][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:55:14,962][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:55:15,466][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:55:15,968][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:55:16,480][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:55:16,984][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:55:17,490][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:55:18,000][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:55:18,503][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:55:19,010][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:55:19,515][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:55:20,021][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:55:20,527][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:55:21,032][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:55:21,536][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:55:22,042][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:55:22,548][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:55:23,080][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:55:23,590][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:55:24,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10069 tokens. [2025-11-13 05:55:24,963][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.07%, Current % of VRAM taken: 58.31%, Block Peak % of device VRAM: 62.14%, ΔTime: 00:00:33 [2025-11-13 05:55:25,746][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:55:25,748][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:55:25,750][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:55:26,901][__main__][INFO] - Iteration 502 took 1m 4s (42.70% Gen, 55.51% Train). Generation: 27s, Training: 35s. Estimated remaining time: 45h 53m 32s. Estimated total time: 53h 39m 17s. Time estimates for 10 more iterations: 10m 43s, 100 more iterations: 1h 47m 18s, 500 more iterations: 8h 56m 32s. [2025-11-13 05:55:26,904][__main__][INFO] - Starting iteration 502. [2025-11-13 05:55:27,374][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 05:55:27,375][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:55:51,603][__main__][INFO] - Number of regex retries in iteration 502: 0 [2025-11-13 05:55:51,606][__main__][INFO] - agents played in iteration 502 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:55:52,501][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:55:52,536][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:55:52,565][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:55:52,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:55:52,590][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:55:52,591][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:55:53,452][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:55:54,007][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:55:54,531][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:55:55,035][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:55:55,539][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:55:56,056][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:55:56,563][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:55:57,080][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:55:57,584][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:55:58,098][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:55:58,608][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:55:59,119][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:55:59,622][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:56:00,126][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:56:00,631][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:56:01,141][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:56:01,645][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:56:02,153][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:56:02,667][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:56:03,174][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:56:03,686][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:56:04,194][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:56:04,723][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:56:05,231][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:56:05,738][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:56:06,245][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:56:06,749][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:56:07,255][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:56:07,780][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:56:08,287][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:56:08,792][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:56:09,299][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:56:09,803][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:56:10,310][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:56:10,818][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:56:11,321][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:56:11,826][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:56:12,331][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:56:12,832][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:56:13,334][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:56:13,837][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:56:14,339][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:56:14,839][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:56:15,339][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:56:15,839][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:56:16,339][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:56:16,841][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:56:17,341][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:56:17,846][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:56:18,362][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:56:18,863][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:56:19,371][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:56:19,873][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:56:20,376][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:56:20,880][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:56:21,381][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:56:21,884][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:56:22,385][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:56:22,888][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:56:23,403][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:56:23,905][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:56:24,406][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:56:24,914][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:56:25,416][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:56:25,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10137 tokens. [2025-11-13 05:56:26,750][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.45%, ΔTime: 00:00:33 [2025-11-13 05:56:27,411][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:56:27,413][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:56:27,415][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:56:28,338][__main__][INFO] - Iteration 503 took 1m 0s (39.74% Gen, 58.74% Train). Generation: 24s, Training: 35s. Estimated remaining time: 43h 1m 27s. Estimated total time: 50h 48m 13s. Time estimates for 10 more iterations: 10m 9s, 100 more iterations: 1h 41m 36s, 500 more iterations: 8h 28m 2s. [2025-11-13 05:56:28,341][__main__][INFO] - Starting iteration 503. [2025-11-13 05:56:28,807][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 05:56:28,808][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:56:54,902][__main__][INFO] - Number of regex retries in iteration 503: 0 [2025-11-13 05:56:54,902][__main__][INFO] - agents played in iteration 503 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:56:55,723][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:56:55,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:56:55,769][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:56:55,791][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:56:55,792][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:56:55,794][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:56:56,639][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:56:57,108][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:56:57,620][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:56:58,132][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:56:58,642][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:56:59,150][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:56:59,666][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:57:00,176][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:57:00,683][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:57:01,188][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:57:01,695][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:57:02,206][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:57:02,709][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:57:03,214][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:57:03,723][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:57:04,222][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:57:04,724][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:57:05,230][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:57:05,738][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:57:06,245][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:57:06,745][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:57:07,247][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:57:07,758][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:57:08,261][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:57:08,772][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:57:09,274][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:57:09,775][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:57:10,279][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:57:10,780][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:57:11,284][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:57:11,786][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:57:12,292][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:57:12,803][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:57:13,308][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:57:13,815][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:57:14,319][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:57:14,822][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:57:15,328][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:57:15,832][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:57:16,338][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:57:16,846][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:57:17,349][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:57:17,849][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:57:18,351][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:57:18,852][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:57:19,367][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:57:19,867][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:57:20,368][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:57:20,877][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:57:21,380][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:57:21,889][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:57:22,393][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:57:22,893][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:57:23,396][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:57:23,897][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:57:24,397][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:57:24,902][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:57:25,402][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:57:25,911][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:57:26,412][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:57:26,912][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:57:27,412][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:57:27,912][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:57:28,416][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:57:28,917][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10121 tokens. [2025-11-13 05:57:29,688][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 05:57:30,414][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:57:30,417][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:57:30,418][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:57:31,454][__main__][INFO] - Iteration 504 took 1m 2s (41.65% Gen, 56.69% Train). Generation: 26s, Training: 35s. Estimated remaining time: 44h 24m 30s. Estimated total time: 52h 12m 20s. Time estimates for 10 more iterations: 10m 26s, 100 more iterations: 1h 44m 24s, 500 more iterations: 8h 42m 3s. [2025-11-13 05:57:31,456][__main__][INFO] - Starting iteration 504. [2025-11-13 05:57:31,925][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 05:57:31,925][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:58:00,092][__main__][INFO] - Number of regex retries in iteration 504: 0 [2025-11-13 05:58:00,094][__main__][INFO] - agents played in iteration 504 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:58:00,937][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:58:00,960][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:58:00,985][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:58:01,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:58:01,010][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:58:01,011][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:58:01,871][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:58:02,336][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:58:02,847][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:58:03,359][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:58:03,863][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:58:04,368][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:58:04,875][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:58:05,398][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:58:05,901][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:58:06,404][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:58:06,910][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:58:07,413][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:58:07,916][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:58:08,419][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:58:08,921][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:58:09,425][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:58:09,929][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:58:10,436][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:58:10,941][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:58:11,449][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:58:11,957][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:58:12,464][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:58:12,968][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:58:13,490][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:58:13,995][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:58:14,507][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:58:15,008][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:58:15,510][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:58:16,015][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:58:16,521][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:58:17,023][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:58:17,525][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:58:18,026][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:58:18,537][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:58:19,045][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:58:19,552][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:58:20,056][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:58:20,561][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:58:21,064][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:58:21,570][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:58:22,077][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:58:22,583][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:58:23,084][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:58:23,593][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:58:24,099][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:58:24,608][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:58:25,119][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:58:25,626][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:58:26,133][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:58:26,642][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:58:27,147][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:58:27,653][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:58:28,160][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:58:28,660][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:58:29,163][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:58:29,663][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:58:30,166][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:58:30,672][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:58:31,175][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:58:31,703][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:58:32,204][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:58:32,706][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:58:33,210][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:58:33,717][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:58:34,226][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10135 tokens. [2025-11-13 05:58:35,042][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.13%, ΔTime: 00:00:33 [2025-11-13 05:58:35,690][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:58:35,692][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:58:35,694][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:58:36,578][__main__][INFO] - Iteration 505 took 1m 4s (43.57% Gen, 55.06% Train). Generation: 28s, Training: 35s. Estimated remaining time: 46h 3m 46s. Estimated total time: 53h 52m 41s. Time estimates for 10 more iterations: 10m 46s, 100 more iterations: 1h 47m 45s, 500 more iterations: 8h 58m 46s. [2025-11-13 05:58:36,581][__main__][INFO] - Starting iteration 505. [2025-11-13 05:58:37,065][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 05:58:37,066][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:58:56,759][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:59:06,963][__main__][INFO] - Number of regex retries in iteration 505: 1 [2025-11-13 05:59:06,965][__main__][INFO] - agents played in iteration 505 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 05:59:07,779][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:59:07,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:59:07,826][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:59:07,849][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 05:59:07,850][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 05:59:07,851][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 05:59:08,716][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 05:59:09,178][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 05:59:09,686][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 05:59:10,201][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 05:59:10,708][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 05:59:11,218][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 05:59:11,722][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 05:59:12,230][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 05:59:12,736][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 05:59:13,245][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 05:59:13,756][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 05:59:14,261][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 05:59:14,769][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 05:59:15,276][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 05:59:15,780][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 05:59:16,286][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 05:59:16,793][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 05:59:17,304][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 05:59:17,815][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 05:59:18,319][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 05:59:18,822][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 05:59:19,327][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 05:59:19,830][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 05:59:20,335][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 05:59:20,842][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 05:59:21,343][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 05:59:21,845][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 05:59:22,350][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 05:59:22,853][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 05:59:23,358][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 05:59:23,861][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 05:59:24,367][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 05:59:24,869][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 05:59:25,372][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 05:59:25,890][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 05:59:26,396][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 05:59:26,909][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 05:59:27,416][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 05:59:27,919][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 05:59:28,421][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 05:59:28,923][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 05:59:29,427][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 05:59:29,926][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 05:59:30,430][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 05:59:30,933][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 05:59:31,432][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 05:59:31,936][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 05:59:32,437][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 05:59:32,937][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 05:59:33,438][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 05:59:33,937][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 05:59:34,435][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 05:59:34,938][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 05:59:35,436][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 05:59:35,937][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 05:59:36,436][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 05:59:36,935][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 05:59:37,439][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 05:59:37,938][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 05:59:38,440][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 05:59:38,939][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 05:59:39,439][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 05:59:39,938][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 05:59:40,439][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 05:59:40,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10122 tokens. [2025-11-13 05:59:41,716][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:33 [2025-11-13 05:59:42,457][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 05:59:42,458][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 05:59:42,460][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 05:59:43,441][__main__][INFO] - Iteration 506 took 1m 6s (45.04% Gen, 53.47% Train). Generation: 29s, Training: 35s. Estimated remaining time: 47h 28m 48s. Estimated total time: 55h 18m 50s. Time estimates for 10 more iterations: 11m 3s, 100 more iterations: 1h 50m 37s, 500 more iterations: 9h 13m 8s. [2025-11-13 05:59:43,444][__main__][INFO] - Starting iteration 506. [2025-11-13 05:59:43,924][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 05:59:43,925][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 05:59:59,331][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 05:59:59,336][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 06:00:11,230][__main__][INFO] - Number of regex retries in iteration 506: 2 [2025-11-13 06:00:11,231][__main__][INFO] - agents played in iteration 506 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:00:12,118][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:00:12,154][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:00:12,182][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:00:12,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:00:12,206][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:00:12,206][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:00:13,065][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:00:13,530][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:00:14,047][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:00:14,555][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:00:15,066][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:00:15,576][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:00:16,085][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:00:16,597][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:00:17,111][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:00:17,624][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:00:18,139][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:00:18,647][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:00:19,153][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:00:19,665][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:00:20,175][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:00:20,685][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:00:21,196][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:00:21,700][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:00:22,227][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:00:22,730][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:00:23,237][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:00:23,744][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:00:24,251][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:00:24,759][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:00:25,264][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:00:25,766][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:00:26,270][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:00:26,773][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:00:27,277][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:00:27,781][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:00:28,283][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:00:28,791][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:00:29,293][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:00:29,800][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:00:30,311][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:00:30,817][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:00:31,329][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:00:31,837][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:00:32,347][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:00:32,855][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:00:33,361][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:00:33,862][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:00:34,368][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:00:34,879][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:00:35,387][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:00:35,893][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:00:36,402][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:00:36,906][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:00:37,411][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:00:37,933][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:00:38,433][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:00:38,936][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:00:39,442][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:00:39,944][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:00:40,443][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:00:40,946][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:00:41,448][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:00:41,954][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:00:42,454][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:00:42,955][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:00:43,457][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:00:43,960][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:00:44,462][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:00:44,966][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:00:45,465][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10316 tokens. [2025-11-13 06:00:46,235][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.44%, ΔTime: 00:00:33 [2025-11-13 06:00:46,853][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:00:46,854][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:00:46,856][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:00:47,778][__main__][INFO] - Iteration 507 took 1m 3s (42.76% Gen, 55.79% Train). Generation: 27s, Training: 35s. Estimated remaining time: 45h 21m 36s. Estimated total time: 53h 12m 42s. Time estimates for 10 more iterations: 10m 38s, 100 more iterations: 1h 46m 25s, 500 more iterations: 8h 52m 7s. [2025-11-13 06:00:47,780][__main__][INFO] - Starting iteration 507. [2025-11-13 06:00:48,293][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 06:00:48,294][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:01:17,508][__main__][INFO] - Number of regex retries in iteration 507: 0 [2025-11-13 06:01:17,508][__main__][INFO] - agents played in iteration 507 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:01:18,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:01:18,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:01:18,414][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:01:18,437][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:01:18,437][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:01:18,438][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:01:19,301][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:01:19,766][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:01:20,280][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:01:20,795][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:01:21,303][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:01:21,818][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:01:22,326][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:01:22,840][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:01:23,350][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:01:23,859][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:01:24,364][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:01:24,866][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:01:25,373][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:01:25,879][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:01:26,386][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:01:26,898][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:01:27,404][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:01:27,910][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:01:28,417][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:01:28,919][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:01:29,433][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:01:29,936][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:01:30,438][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:01:30,944][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:01:31,446][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:01:31,952][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:01:32,454][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:01:32,958][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:01:33,462][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:01:33,966][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:01:34,473][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:01:34,978][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:01:35,485][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:01:35,991][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:01:36,496][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:01:37,000][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:01:37,506][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:01:38,007][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:01:38,507][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:01:39,008][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:01:39,509][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:01:40,025][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:01:40,525][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:01:41,024][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:01:41,524][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:01:42,024][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:01:42,534][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:01:43,033][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:01:43,536][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:01:44,041][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:01:44,543][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:01:45,048][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:01:45,550][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:01:46,053][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:01:46,556][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:01:47,059][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:01:47,562][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:01:48,063][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:01:48,563][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:01:49,063][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:01:49,562][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:01:50,063][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:01:50,566][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:01:51,071][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:01:51,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10145 tokens. [2025-11-13 06:01:52,325][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:33 [2025-11-13 06:01:53,055][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:01:53,057][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:01:53,058][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:01:54,031][__main__][INFO] - Iteration 508 took 1m 5s (44.44% Gen, 54.08% Train). Generation: 29s, Training: 35s. Estimated remaining time: 46h 54m 45s. Estimated total time: 54h 46m 57s. Time estimates for 10 more iterations: 10m 57s, 100 more iterations: 1h 49m 33s, 500 more iterations: 9h 7m 49s. [2025-11-13 06:01:54,033][__main__][INFO] - Starting iteration 508. [2025-11-13 06:01:54,522][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 06:01:54,522][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:02:20,541][__main__][INFO] - Number of regex retries in iteration 508: 0 [2025-11-13 06:02:20,543][__main__][INFO] - agents played in iteration 508 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:02:21,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:02:21,427][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:02:21,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:02:21,473][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:02:21,474][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:02:21,475][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:02:22,339][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:02:22,803][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:02:23,328][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:02:23,840][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:02:24,355][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:02:24,863][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:02:25,369][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:02:25,878][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:02:26,384][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:02:26,895][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:02:27,400][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:02:27,907][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:02:28,414][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:02:28,924][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:02:29,429][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:02:29,945][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:02:30,454][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:02:30,971][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:02:31,476][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:02:31,981][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:02:32,487][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:02:32,997][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:02:33,506][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:02:34,012][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:02:34,517][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:02:35,027][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:02:35,530][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:02:36,036][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:02:36,539][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:02:37,046][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:02:37,568][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:02:38,075][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:02:38,579][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:02:39,090][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:02:39,595][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:02:40,101][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:02:40,608][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:02:41,115][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:02:41,625][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:02:42,131][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:02:42,644][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:02:43,150][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:02:43,653][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:02:44,159][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:02:44,663][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:02:45,168][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:02:45,676][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:02:46,185][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:02:46,695][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:02:47,204][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:02:47,713][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:02:48,217][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:02:48,722][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:02:49,226][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:02:49,731][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:02:50,232][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:02:50,736][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:02:51,237][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:02:51,737][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:02:52,251][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:02:52,752][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:02:53,270][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:02:53,772][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:02:54,273][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:02:54,774][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10181 tokens. [2025-11-13 06:02:55,545][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.32%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:33 [2025-11-13 06:02:56,186][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:02:56,188][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:02:56,190][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:02:57,098][__main__][INFO] - Iteration 509 took 1m 2s (41.58% Gen, 56.97% Train). Generation: 26s, Training: 35s. Estimated remaining time: 44h 15m 34s. Estimated total time: 52h 8m 50s. Time estimates for 10 more iterations: 10m 25s, 100 more iterations: 1h 44m 17s, 500 more iterations: 8h 41m 28s. [2025-11-13 06:02:57,100][__main__][INFO] - Starting iteration 509. [2025-11-13 06:02:57,585][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 06:02:57,586][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:03:30,309][__main__][INFO] - Number of regex retries in iteration 509: 0 [2025-11-13 06:03:30,310][__main__][INFO] - agents played in iteration 509 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:03:31,158][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:03:31,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:03:31,210][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:03:31,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:03:31,234][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:03:31,235][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:03:32,079][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:03:32,552][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:03:33,061][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:03:33,570][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:03:34,093][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:03:34,610][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:03:35,119][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:03:35,628][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:03:36,138][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:03:36,653][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:03:37,158][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:03:37,665][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:03:38,170][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:03:38,672][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:03:39,186][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:03:39,693][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:03:40,196][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:03:40,708][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:03:41,210][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:03:41,714][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:03:42,223][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:03:42,731][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:03:43,233][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:03:43,740][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:03:44,242][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:03:44,759][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:03:45,263][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:03:45,766][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:03:46,276][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:03:46,778][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:03:47,294][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:03:47,802][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:03:48,307][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:03:48,815][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:03:49,319][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:03:49,826][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:03:50,334][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:03:50,838][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:03:51,352][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:03:51,853][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:03:52,355][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:03:52,863][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:03:53,366][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:03:53,884][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:03:54,383][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:03:54,881][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:03:55,385][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:03:55,885][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:03:56,390][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:03:56,894][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:03:57,398][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:03:57,903][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:03:58,408][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:03:58,909][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:03:59,413][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:03:59,916][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:04:00,419][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:04:00,919][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:04:01,426][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:04:01,930][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:04:02,431][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:04:02,932][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:04:03,444][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:04:03,950][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:04:04,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10242 tokens. [2025-11-13 06:04:05,224][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 06:04:05,958][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:04:05,960][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:04:05,961][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:04:06,955][__main__][INFO] - Iteration 510 took 1m 9s (47.17% Gen, 51.39% Train). Generation: 32s, Training: 35s. Estimated remaining time: 49h 54m 7s. Estimated total time: 57h 48m 33s. Time estimates for 10 more iterations: 11m 33s, 100 more iterations: 1h 55m 37s, 500 more iterations: 9h 38m 5s. [2025-11-13 06:04:06,957][__main__][INFO] - Starting iteration 510. [2025-11-13 06:04:07,429][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 50 and human policies 1. [2025-11-13 06:04:07,430][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:04:40,603][__main__][INFO] - Number of regex retries in iteration 510: 0 [2025-11-13 06:04:40,605][__main__][INFO] - agents played in iteration 510 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:04:41,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:04:41,508][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:04:41,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:04:41,562][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:04:41,563][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:04:41,564][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:04:42,424][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:04:42,890][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:04:43,409][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:04:43,911][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:04:44,418][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:04:44,923][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:04:45,428][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:04:45,935][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:04:46,445][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:04:46,952][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:04:47,458][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:04:47,966][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:04:48,474][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:04:48,980][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:04:49,496][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:04:50,001][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:04:50,511][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:04:51,020][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:04:51,524][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:04:52,031][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:04:52,539][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:04:53,043][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:04:53,555][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:04:54,060][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:04:54,565][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:04:55,067][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:04:55,573][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:04:56,091][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:04:56,595][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:04:57,100][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:04:57,603][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:04:58,103][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:04:58,608][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:04:59,109][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:04:59,609][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:05:00,112][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:05:00,618][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:05:01,121][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:05:01,621][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:05:02,124][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:05:02,628][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:05:03,130][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:05:03,631][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:05:04,135][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:05:04,635][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:05:05,139][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:05:05,639][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:05:06,138][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:05:06,639][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:05:07,143][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:05:07,645][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:05:08,144][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:05:08,644][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:05:09,145][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:05:09,647][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:05:10,147][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:05:10,648][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:05:11,148][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:05:11,652][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:05:12,156][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:05:12,657][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:05:13,157][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:05:13,658][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:05:14,163][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:05:14,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10188 tokens. [2025-11-13 06:05:15,442][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.04%, Current % of VRAM taken: 58.29%, Block Peak % of device VRAM: 62.13%, ΔTime: 00:00:33 [2025-11-13 06:05:16,063][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:05:16,065][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:05:16,066][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:05:17,943][__main__][INFO] - Iteration 511 took 1m 10s (47.05% Gen, 50.29% Train). Generation: 33s, Training: 35s. Estimated remaining time: 50h 50m 6s. Estimated total time: 58h 45m 42s. Time estimates for 10 more iterations: 11m 45s, 100 more iterations: 1h 57m 31s, 500 more iterations: 9h 47m 37s. [2025-11-13 06:05:17,945][__main__][INFO] - Starting iteration 511. [2025-11-13 06:05:18,451][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 06:05:18,453][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:05:44,174][__main__][INFO] - Number of regex retries in iteration 511: 0 [2025-11-13 06:05:44,176][__main__][INFO] - agents played in iteration 511 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:05:45,097][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:05:45,132][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:05:45,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:05:45,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:05:45,191][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:05:45,191][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:05:46,047][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:05:46,601][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:05:47,131][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:05:47,636][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:05:48,138][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:05:48,640][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:05:49,147][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:05:49,648][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:05:50,150][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:05:50,653][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:05:51,162][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:05:51,672][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:05:52,174][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:05:52,677][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:05:53,184][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:05:53,713][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:05:54,218][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:05:54,725][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:05:55,224][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:05:55,727][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:05:56,231][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:05:56,734][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:05:57,238][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:05:57,744][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:05:58,246][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:05:58,754][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:05:59,255][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:05:59,755][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:06:00,258][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:06:00,761][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:06:01,264][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:06:01,766][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:06:02,268][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:06:02,772][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:06:03,282][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:06:03,783][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:06:04,294][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:06:04,799][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:06:05,307][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:06:05,812][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:06:06,313][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:06:06,824][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:06:07,325][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:06:07,827][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:06:08,328][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:06:08,832][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:06:09,335][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:06:09,836][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:06:10,337][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:06:10,838][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:06:11,339][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:06:11,840][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:06:12,341][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:06:12,840][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:06:13,343][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:06:13,844][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:06:14,346][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:06:14,845][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:06:15,345][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:06:15,850][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:06:16,352][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:06:16,853][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:06:17,360][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:06:17,859][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:06:18,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10081 tokens. [2025-11-13 06:06:19,152][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.06%, Current % of VRAM taken: 58.31%, Block Peak % of device VRAM: 62.11%, ΔTime: 00:00:33 [2025-11-13 06:06:19,813][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:06:19,815][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:06:19,816][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:06:20,749][__main__][INFO] - Iteration 512 took 1m 2s (41.29% Gen, 57.21% Train). Generation: 25s, Training: 35s. Estimated remaining time: 43h 58m 18s. Estimated total time: 51h 54m 57s. Time estimates for 10 more iterations: 10m 22s, 100 more iterations: 1h 43m 49s, 500 more iterations: 8h 39m 9s. [2025-11-13 06:06:20,752][__main__][INFO] - Starting iteration 512. [2025-11-13 06:06:21,228][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 06:06:21,228][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:06:49,791][__main__][INFO] - Number of regex retries in iteration 512: 0 [2025-11-13 06:06:49,791][__main__][INFO] - agents played in iteration 512 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:06:50,661][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:06:50,684][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:06:50,709][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:06:50,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:06:50,733][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:06:50,734][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:06:51,573][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:06:52,030][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:06:52,541][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:06:53,054][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:06:53,557][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:06:54,061][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:06:54,564][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:06:55,068][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:06:55,574][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:06:56,078][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:06:56,586][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:06:57,098][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:06:57,602][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:06:58,107][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:06:58,613][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:06:59,130][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:06:59,635][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:07:00,139][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:07:00,645][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:07:01,155][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:07:01,660][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:07:02,166][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:07:02,671][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:07:03,179][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:07:03,681][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:07:04,186][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:07:04,689][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:07:05,195][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:07:05,713][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:07:06,214][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:07:06,716][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:07:07,219][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:07:07,720][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:07:08,231][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:07:08,735][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:07:09,236][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:07:09,739][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:07:10,241][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:07:10,742][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:07:11,244][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:07:11,753][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:07:12,260][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:07:12,760][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:07:13,262][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:07:13,765][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:07:14,289][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:07:14,795][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:07:15,298][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:07:15,806][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:07:16,316][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:07:16,817][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:07:17,333][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:07:17,837][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:07:18,341][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:07:18,850][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:07:19,356][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:07:19,858][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:07:20,359][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:07:20,864][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:07:21,366][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:07:21,869][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:07:22,375][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:07:22,879][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:07:23,384][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:07:23,886][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10139 tokens. [2025-11-13 06:07:24,657][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.16%, ΔTime: 00:00:33 [2025-11-13 06:07:25,445][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:07:25,447][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:07:25,448][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:07:26,469][__main__][INFO] - Iteration 513 took 1m 5s (43.78% Gen, 54.65% Train). Generation: 28s, Training: 35s. Estimated remaining time: 46h 24m 19s. Estimated total time: 54h 22m 3s. Time estimates for 10 more iterations: 10m 52s, 100 more iterations: 1h 48m 44s, 500 more iterations: 9h 3m 40s. [2025-11-13 06:07:26,471][__main__][INFO] - Starting iteration 513. [2025-11-13 06:07:26,963][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 06:07:26,963][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:07:54,049][__main__][INFO] - Number of regex retries in iteration 513: 0 [2025-11-13 06:07:54,051][__main__][INFO] - agents played in iteration 513 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:07:55,022][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:07:55,077][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:07:55,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:07:55,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:07:55,151][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:07:55,152][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:07:56,088][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:07:56,639][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:07:57,165][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:07:57,675][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:07:58,179][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:07:58,684][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:07:59,189][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:07:59,695][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:08:00,198][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:08:00,703][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:08:01,209][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:08:01,714][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:08:02,224][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:08:02,733][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:08:03,242][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:08:03,747][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:08:04,253][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:08:04,758][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:08:05,262][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:08:05,766][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:08:06,273][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:08:06,775][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:08:07,278][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:08:07,780][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:08:08,281][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:08:08,782][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:08:09,285][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:08:09,789][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:08:10,291][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:08:10,797][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:08:11,301][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:08:11,805][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:08:12,306][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:08:12,811][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:08:13,314][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:08:13,818][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:08:14,327][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:08:14,834][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:08:15,337][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:08:15,839][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:08:16,340][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:08:16,862][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:08:17,363][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:08:17,866][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:08:18,370][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:08:18,872][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:08:19,378][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:08:19,879][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:08:20,381][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:08:20,882][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:08:21,383][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:08:21,884][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:08:22,383][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:08:22,883][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:08:23,388][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:08:23,888][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:08:24,389][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:08:24,895][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:08:25,396][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:08:25,900][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:08:26,400][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:08:26,900][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:08:27,403][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:08:27,907][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:08:28,418][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10086 tokens. [2025-11-13 06:08:29,262][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.16%, ΔTime: 00:00:33 [2025-11-13 06:08:29,911][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:08:29,914][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:08:29,916][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:08:30,822][__main__][INFO] - Iteration 514 took 1m 3s (42.42% Gen, 56.16% Train). Generation: 27s, Training: 35s. Estimated remaining time: 45h 14m 9s. Estimated total time: 53h 12m 58s. Time estimates for 10 more iterations: 10m 38s, 100 more iterations: 1h 46m 25s, 500 more iterations: 8h 52m 9s. [2025-11-13 06:08:30,824][__main__][INFO] - Starting iteration 514. [2025-11-13 06:08:31,299][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 06:08:31,300][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:09:02,815][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 06:09:03,793][__main__][INFO] - Number of regex retries in iteration 514: 1 [2025-11-13 06:09:03,794][__main__][INFO] - agents played in iteration 514 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:09:04,664][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:09:04,690][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:09:04,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:09:04,738][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:09:04,739][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:09:04,739][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:09:05,563][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:09:06,022][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:09:06,536][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:09:07,058][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:09:07,571][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:09:08,078][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:09:08,581][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:09:09,084][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:09:09,594][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:09:10,097][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:09:10,605][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:09:11,112][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:09:11,616][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:09:12,126][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:09:12,632][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:09:13,135][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:09:13,642][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:09:14,148][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:09:14,661][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:09:15,165][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:09:15,666][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:09:16,172][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:09:16,678][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:09:17,181][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:09:17,683][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:09:18,187][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:09:18,694][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:09:19,198][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:09:19,708][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:09:20,210][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:09:20,709][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:09:21,208][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:09:21,709][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:09:22,213][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:09:22,719][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:09:23,220][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:09:23,719][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:09:24,222][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:09:24,722][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:09:25,238][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:09:25,742][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:09:26,245][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:09:26,747][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:09:27,249][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:09:27,760][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:09:28,259][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:09:28,761][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:09:29,262][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:09:29,766][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:09:30,269][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:09:30,768][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:09:31,264][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:09:31,770][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:09:32,269][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:09:32,772][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:09:33,276][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:09:33,779][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:09:34,280][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:09:34,788][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:09:35,291][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:09:35,801][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:09:36,306][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:09:36,809][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:09:37,314][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:09:37,822][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10217 tokens. [2025-11-13 06:09:38,715][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 06:09:39,456][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:09:39,458][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:09:39,459][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:09:40,531][__main__][INFO] - Iteration 515 took 1m 9s (46.93% Gen, 51.51% Train). Generation: 32s, Training: 35s. Estimated remaining time: 49h 41m 39s. Estimated total time: 57h 41m 38s. Time estimates for 10 more iterations: 11m 32s, 100 more iterations: 1h 55m 23s, 500 more iterations: 9h 36m 56s. [2025-11-13 06:09:40,534][__main__][INFO] - Starting iteration 515. [2025-11-13 06:09:41,021][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 06:09:41,022][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:10:11,312][__main__][INFO] - Number of regex retries in iteration 515: 0 [2025-11-13 06:10:11,314][__main__][INFO] - agents played in iteration 515 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:10:12,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:10:12,198][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:10:12,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:10:12,247][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:10:12,248][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:10:12,248][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:10:13,049][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:10:13,506][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:10:14,015][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:10:14,522][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:10:15,031][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:10:15,533][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:10:16,035][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:10:16,555][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:10:17,056][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:10:17,558][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:10:18,060][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:10:18,565][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:10:19,072][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:10:19,575][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:10:20,078][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:10:20,584][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:10:21,090][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:10:21,595][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:10:22,096][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:10:22,599][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:10:23,101][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:10:23,606][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:10:24,110][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:10:24,611][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:10:25,114][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:10:25,617][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:10:26,123][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:10:26,626][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:10:27,140][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:10:27,644][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:10:28,153][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:10:28,654][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:10:29,155][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:10:29,675][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:10:30,178][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:10:30,680][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:10:31,184][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:10:31,685][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:10:32,190][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:10:32,692][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:10:33,195][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:10:33,697][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:10:34,201][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:10:34,703][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:10:35,208][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:10:35,712][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:10:36,220][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:10:36,727][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:10:37,231][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:10:37,761][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:10:38,273][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:10:38,783][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:10:39,287][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:10:39,792][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:10:40,303][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:10:40,806][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:10:41,312][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:10:41,817][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:10:42,325][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:10:42,835][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:10:43,340][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:10:43,848][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:10:44,372][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:10:44,878][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:10:45,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10108 tokens. [2025-11-13 06:10:46,289][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.07%, Current % of VRAM taken: 58.31%, Block Peak % of device VRAM: 62.19%, ΔTime: 00:00:33 [2025-11-13 06:10:46,947][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:10:46,949][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:10:46,951][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:10:47,881][__main__][INFO] - Iteration 516 took 1m 6s (45.31% Gen, 53.30% Train). Generation: 30s, Training: 35s. Estimated remaining time: 47h 41m 55s. Estimated total time: 55h 43m 2s. Time estimates for 10 more iterations: 11m 8s, 100 more iterations: 1h 51m 26s, 500 more iterations: 9h 17m 10s. [2025-11-13 06:10:47,884][__main__][INFO] - Starting iteration 516. [2025-11-13 06:10:48,369][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 06:10:48,370][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:11:18,050][__main__][INFO] - Number of regex retries in iteration 516: 0 [2025-11-13 06:11:18,051][__main__][INFO] - agents played in iteration 516 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:11:18,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:11:18,923][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:11:18,947][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:11:18,970][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:11:18,970][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:11:18,972][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:11:19,748][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:11:20,206][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:11:20,715][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:11:21,228][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:11:21,741][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:11:22,243][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:11:22,760][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:11:23,263][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:11:23,777][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:11:24,283][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:11:24,789][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:11:25,302][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:11:25,802][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:11:26,317][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:11:26,827][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:11:27,336][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:11:27,846][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:11:28,353][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:11:28,860][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:11:29,370][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:11:29,877][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:11:30,398][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:11:30,905][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:11:31,411][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:11:31,920][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:11:32,425][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:11:32,937][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:11:33,443][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:11:33,947][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:11:34,455][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:11:34,965][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:11:35,466][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:11:35,980][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:11:36,484][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:11:37,005][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:11:37,514][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:11:38,017][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:11:38,524][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:11:39,029][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:11:39,534][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:11:40,041][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:11:40,544][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:11:41,046][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:11:41,556][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:11:42,066][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:11:42,574][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:11:43,084][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:11:43,607][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:11:44,112][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:11:44,619][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:11:45,126][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:11:45,635][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:11:46,142][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:11:46,651][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:11:47,156][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:11:47,681][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:11:48,187][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:11:48,701][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:11:49,207][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:11:49,717][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:11:50,248][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:11:50,760][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:11:51,269][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:11:51,780][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:11:52,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10219 tokens. [2025-11-13 06:11:53,192][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.27%, Current % of VRAM taken: 58.52%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 06:11:53,931][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:11:53,934][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:11:53,936][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:11:54,978][__main__][INFO] - Iteration 517 took 1m 6s (44.56% Gen, 53.88% Train). Generation: 29s, Training: 35s. Estimated remaining time: 47h 28m 15s. Estimated total time: 55h 30m 28s. Time estimates for 10 more iterations: 11m 6s, 100 more iterations: 1h 51m 0s, 500 more iterations: 9h 15m 4s. [2025-11-13 06:11:54,981][__main__][INFO] - Starting iteration 517. [2025-11-13 06:11:55,463][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 06:11:55,463][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:12:23,980][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 06:12:27,594][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 06:12:31,683][__main__][INFO] - Number of regex retries in iteration 517: 2 [2025-11-13 06:12:31,683][__main__][INFO] - agents played in iteration 517 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:12:32,525][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:12:32,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:12:32,579][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:12:32,602][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:12:32,603][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:12:32,603][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:12:33,364][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:12:33,821][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:12:34,333][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:12:34,836][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:12:35,346][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:12:35,852][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:12:36,355][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:12:36,869][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:12:37,374][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:12:37,886][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:12:38,389][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:12:38,893][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:12:39,409][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:12:39,916][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:12:40,422][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:12:40,925][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:12:41,430][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:12:41,934][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:12:42,442][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:12:42,946][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:12:43,454][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:12:43,964][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:12:44,475][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:12:44,980][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:12:45,483][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:12:46,001][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:12:46,508][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:12:47,013][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:12:47,515][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:12:48,022][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:12:48,533][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:12:49,040][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:12:49,544][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:12:50,055][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:12:50,562][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:12:51,068][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:12:51,576][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:12:52,083][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:12:52,604][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:12:53,108][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:12:53,618][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:12:54,126][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:12:54,634][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:12:55,144][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:12:55,658][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:12:56,183][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:12:56,693][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:12:57,204][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:12:57,719][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:12:58,224][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:12:58,733][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:12:59,238][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:12:59,742][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:13:00,251][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:13:00,756][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:13:01,261][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:13:01,781][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:13:02,287][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:13:02,795][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:13:03,300][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:13:03,810][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:13:04,322][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:13:04,831][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:13:05,336][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:13:05,842][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10166 tokens. [2025-11-13 06:13:06,715][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.36%, ΔTime: 00:00:33 [2025-11-13 06:13:07,353][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:13:07,355][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:13:07,357][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:13:08,236][__main__][INFO] - Iteration 518 took 1m 12s (49.77% Gen, 49.02% Train). Generation: 36s, Training: 35s. Estimated remaining time: 52h 35m 14s. Estimated total time: 60h 38m 40s. Time estimates for 10 more iterations: 12m 7s, 100 more iterations: 2h 1m 17s, 500 more iterations: 10h 6m 26s. [2025-11-13 06:13:08,239][__main__][INFO] - Starting iteration 518. [2025-11-13 06:13:08,721][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 06:13:08,721][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:13:30,279][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 0 balls This might seem counterintuitive, but given that both you and Alice value balls the highest at 10, it's likely that splitting the balls will lead to a proportional split which might not fully utilize your and Alice's high values for balls. By not proposing any balls, you and Alice might redirect the focus to other items, potentially leading to a more efficient use of your high values for hats and books. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 06:13:32,137][__main__][INFO] - Number of regex retries in iteration 518: 1 [2025-11-13 06:13:32,137][__main__][INFO] - agents played in iteration 518 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:13:32,933][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:13:32,956][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:13:32,978][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:13:33,002][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:13:33,002][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:13:33,003][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:13:33,775][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:13:34,236][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:13:34,746][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:13:35,250][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:13:35,754][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:13:36,263][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:13:36,769][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:13:37,281][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:13:37,789][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:13:38,292][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:13:38,797][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:13:39,300][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:13:39,803][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:13:40,313][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:13:40,817][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:13:41,327][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:13:41,829][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:13:42,332][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:13:42,837][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:13:43,337][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:13:43,846][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:13:44,349][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:13:44,853][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:13:45,358][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:13:45,858][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:13:46,357][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:13:46,858][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:13:47,358][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:13:47,859][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:13:48,359][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:13:48,862][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:13:49,363][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:13:49,866][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:13:50,366][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:13:50,866][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:13:51,366][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:13:51,866][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:13:52,365][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:13:52,864][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:13:53,361][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:13:53,861][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:13:54,362][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:13:54,869][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:13:55,372][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:13:55,876][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:13:56,379][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:13:56,886][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:13:57,388][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:13:57,896][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:13:58,416][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:13:58,924][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:13:59,432][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:13:59,943][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:14:00,452][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:14:00,962][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:14:01,469][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:14:01,979][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:14:02,496][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:14:03,003][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:14:03,520][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:14:04,029][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:14:04,538][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:14:05,047][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:14:05,551][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:14:06,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10118 tokens. [2025-11-13 06:14:08,283][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:34 [2025-11-13 06:14:09,993][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:14:09,995][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:14:09,997][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:14:10,912][__main__][INFO] - Iteration 519 took 1m 2s (37.65% Gen, 60.87% Train). Generation: 23s, Training: 37s. Estimated remaining time: 43h 45m 7s. Estimated total time: 51h 49m 36s. Time estimates for 10 more iterations: 10m 21s, 100 more iterations: 1h 43m 39s, 500 more iterations: 8h 38m 16s. [2025-11-13 06:14:10,916][__main__][INFO] - Starting iteration 519. [2025-11-13 06:14:11,401][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 06:14:11,402][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:14:45,966][__main__][INFO] - Number of regex retries in iteration 519: 0 [2025-11-13 06:14:45,966][__main__][INFO] - agents played in iteration 519 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:14:46,798][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:14:46,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:14:46,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:14:46,870][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:14:46,870][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:14:46,871][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:14:47,641][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:14:48,100][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:14:48,610][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:14:49,113][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:14:49,619][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:14:50,123][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:14:50,629][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:14:51,132][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:14:51,635][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:14:52,140][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:14:52,654][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:14:53,162][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:14:53,670][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:14:54,179][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:14:54,682][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:14:55,189][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:14:55,707][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:14:56,214][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:14:56,737][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:14:57,248][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:14:57,752][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:14:58,257][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:14:58,769][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:14:59,279][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:14:59,782][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:15:00,287][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:15:00,797][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:15:01,306][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:15:01,811][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:15:02,321][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:15:02,827][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:15:03,345][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:15:03,857][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:15:04,364][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:15:04,874][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:15:05,381][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:15:05,892][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:15:06,400][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:15:06,911][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:15:07,439][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:15:07,944][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:15:08,454][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:15:08,960][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:15:09,463][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:15:09,968][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:15:10,473][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:15:10,978][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:15:11,488][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:15:11,993][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:15:12,501][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:15:13,009][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:15:13,517][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:15:14,044][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:15:14,552][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:15:15,060][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:15:15,569][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:15:16,078][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:15:16,592][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:15:17,102][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:15:17,610][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:15:18,128][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:15:18,637][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:15:19,149][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:15:19,658][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:15:20,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10230 tokens. [2025-11-13 06:15:21,062][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.36%, ΔTime: 00:00:33 [2025-11-13 06:15:21,865][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:15:21,867][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:15:21,869][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:15:22,874][__main__][INFO] - Iteration 520 took 1m 11s (48.36% Gen, 50.23% Train). Generation: 34s, Training: 35s. Estimated remaining time: 51h 27m 59s. Estimated total time: 59h 33m 40s. Time estimates for 10 more iterations: 11m 54s, 100 more iterations: 1h 59m 7s, 500 more iterations: 9h 55m 36s. [2025-11-13 06:15:22,876][__main__][INFO] - Starting iteration 520. [2025-11-13 06:15:23,364][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 51 and human policies 1. [2025-11-13 06:15:23,365][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:15:37,933][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 06:15:49,524][__main__][INFO] - Number of regex retries in iteration 520: 1 [2025-11-13 06:15:49,525][__main__][INFO] - agents played in iteration 520 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:15:50,307][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:15:50,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:15:50,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:15:50,385][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:15:50,385][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:15:50,386][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:15:51,168][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:15:51,625][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:15:52,154][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:15:52,659][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:15:53,165][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:15:53,676][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:15:54,180][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:15:54,689][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:15:55,196][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:15:55,700][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:15:56,209][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:15:56,715][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:15:57,225][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:15:57,728][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:15:58,235][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:15:58,740][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:15:59,244][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:15:59,755][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:16:00,259][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:16:00,769][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:16:01,279][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:16:01,782][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:16:02,286][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:16:02,789][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:16:03,289][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:16:03,796][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:16:04,303][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:16:04,808][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:16:05,316][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:16:05,820][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:16:06,331][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:16:06,838][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:16:07,342][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:16:07,859][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:16:08,366][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:16:08,871][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:16:09,376][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:16:09,887][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:16:10,392][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:16:10,904][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:16:11,411][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:16:11,926][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:16:12,431][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:16:13,553][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:16:15,138][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:16:15,653][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:16:16,159][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:16:16,663][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:16:17,167][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:16:17,675][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:16:18,181][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:16:18,687][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:16:19,196][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:16:19,704][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:16:20,215][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:16:20,721][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:16:21,228][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:16:21,737][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:16:22,258][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:16:22,761][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:16:23,269][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:16:23,773][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:16:24,283][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:16:24,790][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:16:25,297][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10135 tokens. [2025-11-13 06:16:26,214][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.13%, ΔTime: 00:00:35 [2025-11-13 06:16:26,860][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:16:26,862][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:16:26,863][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:16:28,721][__main__][INFO] - Iteration 521 took 1m 5s (40.03% Gen, 57.13% Train). Generation: 26s, Training: 37s. Estimated remaining time: 46h 21m 4s. Estimated total time: 54h 27m 51s. Time estimates for 10 more iterations: 10m 53s, 100 more iterations: 1h 48m 55s, 500 more iterations: 9h 4m 38s. [2025-11-13 06:16:28,725][__main__][INFO] - Starting iteration 521. [2025-11-13 06:16:29,207][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 06:16:29,208][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:16:52,547][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 06:16:57,890][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 06:17:01,478][__main__][INFO] - Number of regex retries in iteration 521: 2 [2025-11-13 06:17:01,478][__main__][INFO] - agents played in iteration 521 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:17:02,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:17:02,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:17:02,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:17:02,400][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:17:02,401][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:17:02,402][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:17:03,153][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:17:03,608][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:17:04,116][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:17:04,624][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:17:05,128][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:17:05,648][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:17:06,149][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:17:06,650][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:17:07,155][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:17:07,657][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:17:08,159][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:17:08,659][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:17:09,159][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:17:09,661][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:17:10,162][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:17:10,677][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:17:11,177][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:17:11,681][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:17:12,190][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:17:12,696][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:17:13,204][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:17:13,707][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:17:14,215][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:17:14,720][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:17:15,228][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:17:15,735][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:17:16,244][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:17:16,773][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:17:17,282][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:17:17,789][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:17:18,297][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:17:18,807][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:17:19,313][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:17:19,821][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:17:20,325][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:17:20,833][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:17:21,340][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:17:21,845][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:17:22,363][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:17:22,870][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:17:23,375][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:17:23,882][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:17:24,389][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:17:24,900][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:17:25,405][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:17:25,916][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:17:26,425][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:17:26,932][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:17:27,438][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:17:27,947][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:17:28,456][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:17:28,976][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:17:29,483][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:17:29,992][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:17:30,499][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:17:31,007][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:17:31,518][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:17:32,024][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:17:32,531][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:17:33,040][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:17:33,545][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:17:34,052][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:17:34,556][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:17:35,062][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:17:35,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10142 tokens. [2025-11-13 06:17:36,440][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 06:17:37,196][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:17:37,200][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:17:37,201][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:17:38,205][__main__][INFO] - Iteration 522 took 1m 8s (46.77% Gen, 51.77% Train). Generation: 32s, Training: 35s. Estimated remaining time: 49h 21m 59s. Estimated total time: 57h 29m 56s. Time estimates for 10 more iterations: 11m 29s, 100 more iterations: 1h 54m 59s, 500 more iterations: 9h 34m 59s. [2025-11-13 06:17:38,207][__main__][INFO] - Starting iteration 522. [2025-11-13 06:17:38,711][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 06:17:38,713][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:18:01,642][__main__][INFO] - Number of regex retries in iteration 522: 0 [2025-11-13 06:18:01,648][__main__][INFO] - agents played in iteration 522 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:18:02,672][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:18:02,695][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:18:02,718][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:18:02,740][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:18:02,740][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:18:02,741][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:18:03,621][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:18:04,083][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:18:04,598][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:18:05,108][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:18:05,611][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:18:06,119][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:18:06,628][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:18:07,134][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:18:07,639][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:18:08,143][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:18:08,657][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:18:09,163][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:18:09,669][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:18:10,177][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:18:10,686][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:18:11,199][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:18:11,706][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:18:12,208][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:18:12,716][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:18:13,220][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:18:13,724][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:18:14,228][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:18:14,730][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:18:15,236][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:18:15,738][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:18:16,241][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:18:16,744][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:18:17,249][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:18:17,752][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:18:18,267][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:18:18,773][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:18:19,295][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:18:19,799][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:18:20,330][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:18:20,838][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:18:21,351][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:18:21,858][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:18:22,365][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:18:22,878][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:18:23,402][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:18:23,912][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:18:24,417][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:18:24,925][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:18:25,432][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:18:25,944][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:18:26,450][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:18:26,957][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:18:27,474][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:18:27,981][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:18:28,497][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:18:29,005][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:18:29,512][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:18:30,022][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:18:30,534][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:18:31,041][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:18:31,548][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:18:32,055][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:18:32,568][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:18:33,075][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:18:33,581][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:18:34,088][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:18:34,596][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:18:35,106][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:18:35,614][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:18:36,121][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10149 tokens. [2025-11-13 06:18:37,028][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 62.19%, ΔTime: 00:00:33 [2025-11-13 06:18:37,677][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:18:37,680][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:18:37,682][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:18:38,579][__main__][INFO] - Iteration 523 took 59s (38.31% Gen, 60.19% Train). Generation: 22s, Training: 36s. Estimated remaining time: 41h 44m 30s. Estimated total time: 49h 53m 27s. Time estimates for 10 more iterations: 9m 58s, 100 more iterations: 1h 39m 46s, 500 more iterations: 8h 18m 54s. [2025-11-13 06:18:38,582][__main__][INFO] - Starting iteration 523. [2025-11-13 06:18:39,063][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 06:18:39,064][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:19:13,353][__main__][INFO] - Number of regex retries in iteration 523: 0 [2025-11-13 06:19:13,354][__main__][INFO] - agents played in iteration 523 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:19:14,160][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:19:14,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:19:14,208][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:19:14,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:19:14,244][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:19:14,245][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:19:15,087][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:19:15,550][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:19:16,058][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:19:16,563][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:19:17,068][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:19:17,574][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:19:18,079][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:19:18,599][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:19:19,104][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:19:19,612][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:19:20,114][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:19:20,620][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:19:21,127][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:19:21,628][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:19:22,133][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:19:22,633][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:19:23,136][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:19:23,644][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:19:24,148][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:19:24,653][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:19:25,158][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:19:25,659][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:19:26,167][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:19:26,668][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:19:27,174][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:19:27,681][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:19:28,185][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:19:28,690][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:19:29,197][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:19:29,702][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:19:30,225][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:19:30,734][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:19:31,240][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:19:31,754][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:19:32,260][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:19:32,771][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:19:33,283][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:19:33,796][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:19:34,306][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:19:34,813][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:19:35,318][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:19:35,823][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:19:36,327][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:19:36,835][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:19:37,341][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:19:37,851][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:19:38,358][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:19:38,864][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:19:39,374][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:19:39,884][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:19:40,388][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:19:40,898][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:19:41,404][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:19:41,919][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:19:42,425][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:19:42,934][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:19:43,443][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:19:43,951][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:19:44,457][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:19:44,966][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:19:45,471][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:19:45,987][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:19:46,491][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:19:46,996][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:19:47,503][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10060 tokens. [2025-11-13 06:19:48,361][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 06:19:49,100][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:19:49,102][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:19:49,104][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:19:50,103][__main__][INFO] - Iteration 524 took 1m 11s (48.27% Gen, 50.32% Train). Generation: 34s, Training: 35s. Estimated remaining time: 51h 1m 54s. Estimated total time: 59h 12m 2s. Time estimates for 10 more iterations: 11m 50s, 100 more iterations: 1h 58m 24s, 500 more iterations: 9h 52m 0s. [2025-11-13 06:19:50,106][__main__][INFO] - Starting iteration 524. [2025-11-13 06:19:50,576][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 06:19:50,576][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:20:16,365][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 06:20:18,057][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 06:20:20,082][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 06:20:21,120][__main__][INFO] - Number of regex retries in iteration 524: 3 [2025-11-13 06:20:21,120][__main__][INFO] - agents played in iteration 524 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:20:21,983][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:20:22,006][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:20:22,029][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:20:22,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:20:22,052][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:20:22,053][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:20:22,873][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:20:23,338][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:20:23,856][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:20:24,365][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:20:24,872][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:20:25,381][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:20:25,897][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:20:26,405][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:20:26,921][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:20:27,431][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:20:27,936][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:20:28,443][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:20:28,944][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:20:29,446][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:20:29,955][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:20:30,456][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:20:30,959][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:20:31,461][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:20:31,970][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:20:32,481][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:20:32,984][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:20:33,488][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:20:34,006][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:20:34,511][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:20:35,034][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:20:35,544][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:20:36,049][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:20:36,558][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:20:37,063][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:20:37,573][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:20:38,080][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:20:38,586][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:20:39,108][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:20:39,615][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:20:40,131][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:20:40,636][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:20:41,143][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:20:41,657][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:20:42,164][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:20:42,669][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:20:43,172][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:20:43,676][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:20:44,184][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:20:44,693][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:20:45,203][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:20:45,714][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:20:46,222][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:20:46,730][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:20:47,235][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:20:47,742][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:20:48,252][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:20:48,758][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:20:49,262][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:20:49,767][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:20:50,272][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:20:50,784][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:20:51,288][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:20:51,799][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:20:52,319][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:20:52,825][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:20:53,348][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:20:53,855][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:20:54,363][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:20:54,873][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:20:55,380][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10172 tokens. [2025-11-13 06:20:56,271][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 06:20:56,917][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:20:56,919][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:20:56,921][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:20:57,823][__main__][INFO] - Iteration 525 took 1m 7s (45.42% Gen, 53.24% Train). Generation: 30s, Training: 35s. Estimated remaining time: 47h 51m 7s. Estimated total time: 56h 2m 23s. Time estimates for 10 more iterations: 11m 12s, 100 more iterations: 1h 52m 4s, 500 more iterations: 9h 20m 23s. [2025-11-13 06:20:57,825][__main__][INFO] - Starting iteration 525. [2025-11-13 06:20:58,294][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 06:20:58,295][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:21:23,540][__main__][INFO] - Number of regex retries in iteration 525: 0 [2025-11-13 06:21:23,541][__main__][INFO] - agents played in iteration 525 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:21:24,324][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:21:24,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:21:24,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:21:24,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:21:24,404][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:21:24,405][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:21:25,220][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:21:25,678][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:21:26,192][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:21:26,698][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:21:27,200][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:21:27,708][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:21:28,216][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:21:28,725][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:21:29,231][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:21:29,735][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:21:30,241][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:21:30,745][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:21:31,252][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:21:31,764][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:21:32,265][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:21:32,767][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:21:33,273][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:21:33,774][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:21:34,281][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:21:34,782][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:21:35,282][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:21:35,781][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:21:36,282][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:21:36,783][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:21:37,284][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:21:37,786][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:21:38,291][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:21:38,795][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:21:39,302][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:21:39,808][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:21:40,314][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:21:40,831][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:21:41,355][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:21:41,862][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:21:42,364][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:21:42,869][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:21:43,374][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:21:43,883][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:21:44,388][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:21:44,895][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:21:45,405][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:21:45,922][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:21:46,432][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:21:46,942][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:21:47,462][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:21:47,971][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:21:48,480][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:21:48,987][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:21:49,494][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:21:49,999][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:21:50,504][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:21:51,008][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:21:51,523][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:21:52,025][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:21:52,545][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:21:53,048][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:21:53,555][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:21:54,065][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:21:54,574][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:21:55,081][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:21:56,558][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:21:57,759][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:21:58,262][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:21:58,767][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:21:59,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10072 tokens. [2025-11-13 06:22:00,176][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:34 [2025-11-13 06:22:00,851][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:22:00,853][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:22:00,854][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:22:01,755][__main__][INFO] - Iteration 526 took 1m 3s (39.78% Gen, 58.80% Train). Generation: 25s, Training: 37s. Estimated remaining time: 44h 40m 42s. Estimated total time: 52h 53m 2s. Time estimates for 10 more iterations: 10m 34s, 100 more iterations: 1h 45m 46s, 500 more iterations: 8h 48m 50s. [2025-11-13 06:22:01,758][__main__][INFO] - Starting iteration 526. [2025-11-13 06:22:02,238][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 06:22:02,239][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:22:33,538][__main__][INFO] - Number of regex retries in iteration 526: 0 [2025-11-13 06:22:33,539][__main__][INFO] - agents played in iteration 526 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:22:34,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:22:34,423][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:22:34,448][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:22:34,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:22:34,471][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:22:34,472][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:22:35,303][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:22:35,761][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:22:36,281][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:22:36,787][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:22:37,299][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:22:37,805][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:22:38,310][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:22:38,816][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:22:39,324][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:22:39,831][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:22:40,354][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:22:40,856][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:22:41,359][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:22:41,863][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:22:42,370][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:22:42,879][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:22:43,382][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:22:43,889][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:22:44,395][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:22:44,899][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:22:45,406][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:22:45,907][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:22:46,409][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:22:46,915][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:22:47,418][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:22:47,922][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:22:48,427][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:22:48,929][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:22:49,433][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:22:49,938][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:22:50,444][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:22:50,958][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:22:51,468][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:22:51,983][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:22:52,488][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:22:52,992][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:22:53,504][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:22:54,014][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:22:54,519][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:22:55,024][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:22:55,528][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:22:56,037][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:22:56,546][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:22:57,052][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:22:57,563][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:22:58,066][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:22:58,581][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:22:59,091][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:22:59,601][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:23:00,112][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:23:00,618][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:23:01,123][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:23:01,629][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:23:02,136][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:23:02,658][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:23:03,163][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:23:03,670][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:23:04,178][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:23:04,687][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:23:05,198][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:23:05,707][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:23:06,212][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:23:06,719][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:23:07,230][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:23:07,744][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10149 tokens. [2025-11-13 06:23:08,628][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.07%, Current % of VRAM taken: 58.31%, Block Peak % of device VRAM: 62.21%, ΔTime: 00:00:33 [2025-11-13 06:23:09,413][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:23:09,415][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:23:09,416][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:23:10,469][__main__][INFO] - Iteration 527 took 1m 8s (45.87% Gen, 52.58% Train). Generation: 31s, Training: 35s. Estimated remaining time: 48h 38m 5s. Estimated total time: 56h 51m 34s. Time estimates for 10 more iterations: 11m 22s, 100 more iterations: 1h 53m 43s, 500 more iterations: 9h 28m 35s. [2025-11-13 06:23:10,471][__main__][INFO] - Starting iteration 527. [2025-11-13 06:23:10,945][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 06:23:10,946][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:23:35,420][__main__][INFO] - Number of regex retries in iteration 527: 0 [2025-11-13 06:23:35,421][__main__][INFO] - agents played in iteration 527 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:23:36,385][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:23:36,409][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:23:36,432][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:23:36,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:23:36,456][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:23:36,456][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:23:37,298][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:23:37,782][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:23:38,294][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:23:38,802][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:23:39,306][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:23:39,809][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:23:40,317][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:23:40,822][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:23:41,326][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:23:41,836][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:23:42,339][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:23:42,844][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:23:43,349][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:23:43,861][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:23:44,366][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:23:44,867][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:23:45,370][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:23:45,879][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:23:46,383][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:23:46,908][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:23:47,411][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:23:47,914][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:23:48,415][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:23:48,916][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:23:49,423][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:23:49,924][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:23:50,426][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:23:50,932][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:23:51,435][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:23:51,939][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:23:52,442][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:23:52,946][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:23:53,454][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:23:53,967][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:23:54,472][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:23:54,978][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:23:55,482][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:23:56,006][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:23:56,515][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:23:57,025][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:23:57,532][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:23:58,038][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:23:58,549][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:23:59,053][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:23:59,562][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:24:00,067][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:24:00,675][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:24:02,725][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:24:03,249][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:24:03,753][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:24:04,256][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:24:04,759][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:24:05,274][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:24:05,778][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:24:06,282][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:24:06,791][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:24:07,295][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:24:07,809][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:24:08,314][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:24:08,821][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:24:09,333][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:24:09,841][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:24:10,349][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:24:10,857][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:24:11,362][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10016 tokens. [2025-11-13 06:24:12,295][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.05%, Current % of VRAM taken: 58.30%, Block Peak % of device VRAM: 62.12%, ΔTime: 00:00:35 [2025-11-13 06:24:12,965][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:24:12,967][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:24:12,969][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:24:13,919][__main__][INFO] - Iteration 528 took 1m 2s (38.87% Gen, 59.63% Train). Generation: 24s, Training: 37s. Estimated remaining time: 44h 14m 10s. Estimated total time: 52h 28m 42s. Time estimates for 10 more iterations: 10m 29s, 100 more iterations: 1h 44m 57s, 500 more iterations: 8h 44m 47s. [2025-11-13 06:24:13,922][__main__][INFO] - Starting iteration 528. [2025-11-13 06:24:14,415][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 06:24:14,416][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:24:45,559][__main__][INFO] - Number of regex retries in iteration 528: 0 [2025-11-13 06:24:45,560][__main__][INFO] - agents played in iteration 528 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:24:46,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:24:46,416][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:24:46,440][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:24:46,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:24:46,462][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:24:46,464][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:24:47,298][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:24:47,760][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:24:48,279][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:24:48,808][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:24:49,317][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:24:49,823][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:24:50,328][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:24:50,837][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:24:51,347][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:24:51,851][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:24:52,359][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:24:52,874][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:24:53,379][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:24:53,893][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:24:54,398][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:24:54,905][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:24:55,408][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:24:55,910][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:24:56,412][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:24:56,913][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:24:57,415][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:24:57,926][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:24:58,426][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:24:58,928][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:24:59,433][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:24:59,937][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:25:00,439][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:25:00,941][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:25:01,444][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:25:01,946][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:25:02,447][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:25:02,950][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:25:03,456][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:25:03,960][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:25:04,465][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:25:04,972][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:25:05,482][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:25:06,000][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:25:06,510][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:25:07,023][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:25:07,529][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:25:08,034][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:25:08,545][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:25:09,050][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:25:09,562][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:25:10,065][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:25:10,571][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:25:11,082][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:25:11,586][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:25:12,098][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:25:12,614][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:25:13,125][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:25:13,638][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:25:14,146][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:25:14,654][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:25:15,165][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:25:15,677][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:25:16,192][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:25:16,700][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:25:17,210][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:25:17,724][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:25:18,232][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:25:18,740][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:25:19,253][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:25:19,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10118 tokens. [2025-11-13 06:25:20,688][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.33%, Current % of VRAM taken: 58.58%, Block Peak % of device VRAM: 62.48%, ΔTime: 00:00:33 [2025-11-13 06:25:21,461][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:25:21,463][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:25:21,465][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:25:22,471][__main__][INFO] - Iteration 529 took 1m 8s (45.76% Gen, 52.76% Train). Generation: 31s, Training: 35s. Estimated remaining time: 48h 27m 8s. Estimated total time: 56h 42m 48s. Time estimates for 10 more iterations: 11m 20s, 100 more iterations: 1h 53m 25s, 500 more iterations: 9h 27m 8s. [2025-11-13 06:25:22,473][__main__][INFO] - Starting iteration 529. [2025-11-13 06:25:22,953][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 06:25:22,953][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:25:47,175][__main__][INFO] - Number of regex retries in iteration 529: 0 [2025-11-13 06:25:47,175][__main__][INFO] - agents played in iteration 529 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:25:47,999][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.43%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:25:48,022][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.43%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:25:48,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.43%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:25:48,067][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.43%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:25:48,068][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:25:48,068][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:25:48,881][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:25:49,350][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:25:49,861][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:25:50,366][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:25:50,872][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:25:51,380][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:25:51,969][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:25:54,252][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:25:54,757][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:25:55,267][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:25:55,776][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:25:56,285][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:25:56,792][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:25:57,299][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:25:57,801][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:25:58,318][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:25:58,821][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:25:59,327][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:25:59,831][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:26:00,336][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:26:00,852][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:26:01,364][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:26:01,869][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:26:02,377][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:26:02,883][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:26:03,399][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:26:03,907][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:26:04,410][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:26:04,920][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:26:05,422][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:26:05,929][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:26:06,435][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:26:06,942][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:26:07,451][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:26:07,955][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:26:08,462][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:26:08,971][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:26:09,477][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:26:09,984][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:26:10,493][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:26:11,000][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:26:11,513][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:26:12,020][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:26:12,532][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:26:13,038][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:26:13,546][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:26:14,054][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:26:14,562][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:26:15,067][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:26:15,578][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:26:16,086][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:26:16,606][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:26:17,115][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:26:17,632][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:26:18,142][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:26:18,648][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:26:19,160][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:26:19,664][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:26:20,166][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:26:20,669][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:26:21,173][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:26:21,678][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:26:22,184][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:26:22,696][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:26:23,203][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10114 tokens. [2025-11-13 06:26:24,096][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:35 [2025-11-13 06:26:24,732][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:26:24,735][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:26:24,737][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:26:25,679][__main__][INFO] - Iteration 530 took 1m 2s (38.61% Gen, 59.88% Train). Generation: 24s, Training: 37s. Estimated remaining time: 43h 59m 37s. Estimated total time: 52h 16m 21s. Time estimates for 10 more iterations: 10m 27s, 100 more iterations: 1h 44m 32s, 500 more iterations: 8h 42m 43s. [2025-11-13 06:26:25,682][__main__][INFO] - Starting iteration 530. [2025-11-13 06:26:26,165][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 52 and human policies 1. [2025-11-13 06:26:26,167][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:26:58,483][__main__][INFO] - Number of regex retries in iteration 530: 0 [2025-11-13 06:26:58,484][__main__][INFO] - agents played in iteration 530 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:26:59,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:26:59,287][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:26:59,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:26:59,333][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:26:59,334][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:26:59,335][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:27:00,138][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:27:00,596][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:27:01,119][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:27:01,624][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:27:02,138][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:27:02,639][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:27:03,140][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:27:03,643][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:27:04,148][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:27:04,656][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:27:05,160][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:27:05,662][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:27:06,172][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:27:06,675][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:27:07,179][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:27:07,682][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:27:08,188][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:27:08,690][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:27:09,198][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:27:09,701][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:27:10,205][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:27:10,706][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:27:11,206][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:27:11,706][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:27:12,205][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:27:12,716][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:27:13,217][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:27:13,721][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:27:14,227][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:27:14,729][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:27:15,232][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:27:15,739][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:27:16,239][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:27:16,754][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:27:17,260][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:27:17,766][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:27:18,281][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:27:18,786][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:27:19,295][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:27:19,804][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:27:20,311][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:27:20,819][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:27:21,324][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:27:21,833][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:27:22,337][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:27:22,843][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:27:23,360][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:27:23,872][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:27:24,377][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:27:24,884][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:27:25,394][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:27:25,905][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:27:26,411][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:27:26,917][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:27:27,424][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:27:27,932][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:27:28,438][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:27:28,948][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:27:29,453][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:27:29,962][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:27:30,467][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:27:30,973][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:27:31,483][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:27:31,988][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:27:32,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10047 tokens. [2025-11-13 06:27:33,410][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.05%, ΔTime: 00:00:33 [2025-11-13 06:27:34,170][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:27:34,172][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:27:34,174][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:27:36,217][__main__][INFO] - Iteration 531 took 1m 10s (46.13% Gen, 50.95% Train). Generation: 32s, Training: 35s. Estimated remaining time: 50h 4m 42s. Estimated total time: 58h 22m 36s. Time estimates for 10 more iterations: 11m 40s, 100 more iterations: 1h 56m 45s, 500 more iterations: 9h 43m 46s. [2025-11-13 06:27:36,219][__main__][INFO] - Starting iteration 531. [2025-11-13 06:27:36,947][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 06:27:36,948][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:28:02,931][__main__][INFO] - Number of regex retries in iteration 531: 0 [2025-11-13 06:28:02,932][__main__][INFO] - agents played in iteration 531 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:28:03,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:28:03,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:28:03,851][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:28:03,874][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:28:03,875][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:28:03,875][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:28:04,692][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:28:05,157][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:28:05,668][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:28:06,174][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:28:06,690][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:28:07,197][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:28:07,704][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:28:08,211][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:28:08,721][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:28:09,240][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:28:09,748][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:28:10,258][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:28:10,760][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:28:11,265][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:28:11,772][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:28:12,278][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:28:12,783][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:28:13,288][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:28:13,790][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:28:14,291][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:28:14,794][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:28:15,298][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:28:15,802][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:28:16,305][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:28:16,807][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:28:17,322][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:28:17,826][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:28:18,341][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:28:18,845][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:28:19,353][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:28:19,857][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:28:20,359][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:28:20,860][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:28:21,365][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:28:21,871][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:28:22,377][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:28:22,886][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:28:23,394][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:28:23,902][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:28:24,410][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:28:24,927][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:28:25,427][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:28:25,935][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:28:26,457][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:28:26,961][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:28:27,465][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:28:27,970][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:28:28,472][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:28:28,977][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:28:29,483][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:28:29,986][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:28:30,493][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:28:31,000][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:28:31,509][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:28:32,011][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:28:32,518][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:28:33,035][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:28:33,541][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:28:34,051][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:28:34,557][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:28:35,061][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:28:35,571][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:28:36,079][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:28:36,583][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:28:37,088][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10097 tokens. [2025-11-13 06:28:37,966][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.06%, Current % of VRAM taken: 58.30%, Block Peak % of device VRAM: 62.19%, ΔTime: 00:00:33 [2025-11-13 06:28:38,618][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:28:38,620][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:28:38,623][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:28:39,531][__main__][INFO] - Iteration 532 took 1m 2s (41.52% Gen, 57.03% Train). Generation: 25s, Training: 35s. Estimated remaining time: 43h 50m 16s. Estimated total time: 52h 9m 14s. Time estimates for 10 more iterations: 10m 25s, 100 more iterations: 1h 44m 18s, 500 more iterations: 8h 41m 32s. [2025-11-13 06:28:39,533][__main__][INFO] - Starting iteration 532. [2025-11-13 06:28:40,007][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 06:28:40,008][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:29:11,466][__main__][INFO] - Number of regex retries in iteration 532: 0 [2025-11-13 06:29:11,467][__main__][INFO] - agents played in iteration 532 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:29:12,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:29:12,306][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:29:12,329][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:29:12,351][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:29:12,352][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:29:12,353][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:29:13,210][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:29:13,684][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:29:14,192][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:29:14,695][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:29:15,200][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:29:15,703][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:29:16,207][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:29:16,709][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:29:17,212][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:29:17,723][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:29:18,224][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:29:18,725][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:29:19,224][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:29:19,726][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:29:20,236][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:29:20,741][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:29:21,242][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:29:21,755][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:29:22,261][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:29:22,775][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:29:23,285][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:29:23,788][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:29:24,294][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:29:24,800][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:29:25,306][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:29:25,814][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:29:26,313][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:29:26,814][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:29:27,314][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:29:27,820][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:29:28,326][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:29:28,826][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:29:29,325][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:29:29,828][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:29:30,334][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:29:30,838][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:29:31,345][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:29:31,851][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:29:32,360][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:29:32,864][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:29:33,380][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:29:33,882][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:29:34,388][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:29:34,893][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:29:35,396][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:29:35,901][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:29:36,404][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:29:36,909][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:29:37,416][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:29:37,923][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:29:38,431][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:29:38,937][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:29:39,446][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:29:39,970][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:29:40,479][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:29:40,987][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:29:41,494][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:29:42,002][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:29:42,508][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:29:43,015][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:29:44,839][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:29:45,500][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:29:46,009][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:29:46,516][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:29:47,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10124 tokens. [2025-11-13 06:29:47,956][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.05%, Current % of VRAM taken: 58.29%, Block Peak % of device VRAM: 62.43%, ΔTime: 00:00:34 [2025-11-13 06:29:48,608][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:29:48,609][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:29:48,612][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:29:49,555][__main__][INFO] - Iteration 533 took 1m 9s (45.23% Gen, 53.41% Train). Generation: 31s, Training: 37s. Estimated remaining time: 49h 37m 19s. Estimated total time: 57h 57m 27s. Time estimates for 10 more iterations: 11m 35s, 100 more iterations: 1h 55m 54s, 500 more iterations: 9h 39m 34s. [2025-11-13 06:29:49,560][__main__][INFO] - Starting iteration 533. [2025-11-13 06:29:50,047][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 06:29:50,048][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:30:18,807][__main__][INFO] - Number of regex retries in iteration 533: 0 [2025-11-13 06:30:18,808][__main__][INFO] - agents played in iteration 533 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:30:19,758][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:30:19,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:30:19,808][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:30:19,831][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:30:19,831][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:30:19,833][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:30:20,616][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:30:21,075][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:30:21,583][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:30:22,092][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:30:22,594][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:30:23,096][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:30:23,598][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:30:24,104][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:30:24,611][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:30:25,112][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:30:25,613][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:30:26,128][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:30:26,632][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:30:27,135][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:30:27,644][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:30:28,144][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:30:28,647][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:30:29,150][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:30:29,651][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:30:30,164][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:30:30,673][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:30:31,192][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:30:31,699][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:30:32,210][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:30:32,715][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:30:33,218][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:30:33,720][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:30:34,220][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:30:34,720][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:30:35,221][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:30:35,721][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:30:36,222][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:30:36,723][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:30:37,223][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:30:37,730][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:30:38,232][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:30:38,738][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:30:39,243][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:30:39,748][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:30:40,258][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:30:40,765][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:30:41,271][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:30:41,793][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:30:42,301][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:30:42,812][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:30:43,317][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:30:43,823][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:30:44,332][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:30:44,844][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:30:45,372][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:30:45,882][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:30:46,389][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:30:46,904][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:30:47,414][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:30:47,920][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:30:48,428][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:30:48,934][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:30:49,442][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:30:49,958][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:30:50,463][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:30:50,985][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:30:51,497][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:30:52,006][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:30:52,515][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:30:53,021][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10090 tokens. [2025-11-13 06:30:53,919][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.19%, ΔTime: 00:00:33 [2025-11-13 06:30:54,701][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:30:54,704][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:30:54,706][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:30:55,712][__main__][INFO] - Iteration 534 took 1m 5s (43.80% Gen, 54.67% Train). Generation: 28s, Training: 35s. Estimated remaining time: 46h 22m 5s. Estimated total time: 54h 43m 19s. Time estimates for 10 more iterations: 10m 56s, 100 more iterations: 1h 49m 26s, 500 more iterations: 9h 7m 13s. [2025-11-13 06:30:55,715][__main__][INFO] - Starting iteration 534. [2025-11-13 06:30:56,209][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 06:30:56,210][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:31:21,000][__main__][INFO] - Number of regex retries in iteration 534: 0 [2025-11-13 06:31:21,000][__main__][INFO] - agents played in iteration 534 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:31:21,786][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:31:21,809][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:31:21,832][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:31:21,854][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:31:21,855][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:31:21,856][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:31:23,787][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:31:24,254][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:31:24,762][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:31:25,280][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:31:25,783][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:31:26,296][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:31:26,798][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:31:27,302][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:31:27,813][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:31:28,318][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:31:28,828][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:31:29,329][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:31:29,833][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:31:30,339][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:31:30,850][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:31:31,351][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:31:31,855][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:31:32,382][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:31:32,897][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:31:33,403][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:31:33,910][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:31:34,421][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:31:34,926][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:31:35,430][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:31:35,935][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:31:36,440][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:31:36,945][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:31:37,448][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:31:37,950][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:31:38,456][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:31:38,962][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:31:39,487][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:31:39,991][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:31:40,498][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:31:41,001][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:31:41,506][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:31:42,011][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:31:42,514][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:31:43,016][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:31:43,521][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:31:44,030][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:31:44,533][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:31:45,034][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:31:45,541][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:31:46,044][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:31:46,550][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:31:47,051][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:31:47,560][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:31:48,065][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:31:48,574][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:31:49,083][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:31:49,587][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:31:50,096][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:31:50,604][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:31:51,108][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:31:51,614][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:31:52,119][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:31:52,634][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:31:53,144][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:31:53,649][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:31:54,156][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:31:54,664][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:31:55,172][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:31:55,679][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:31:56,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10225 tokens. [2025-11-13 06:31:57,061][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:33 [2025-11-13 06:31:57,709][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:31:57,711][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:31:57,713][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:31:58,633][__main__][INFO] - Iteration 535 took 1m 2s (39.71% Gen, 58.81% Train). Generation: 24s, Training: 36s. Estimated remaining time: 43h 38m 56s. Estimated total time: 52h 1m 13s. Time estimates for 10 more iterations: 10m 24s, 100 more iterations: 1h 44m 2s, 500 more iterations: 8h 40m 12s. [2025-11-13 06:31:58,635][__main__][INFO] - Starting iteration 535. [2025-11-13 06:31:59,117][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 06:31:59,118][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:32:27,487][__main__][INFO] - Number of regex retries in iteration 535: 0 [2025-11-13 06:32:27,488][__main__][INFO] - agents played in iteration 535 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:32:28,435][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:32:28,458][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:32:28,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:32:28,503][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:32:28,504][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:32:28,504][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:32:29,321][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:32:29,786][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:32:30,297][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:32:30,820][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:32:31,326][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:32:31,833][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:32:32,340][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:32:32,846][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:32:33,356][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:32:33,861][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:32:34,369][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:32:34,879][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:32:35,383][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:32:35,903][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:32:36,406][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:32:36,913][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:32:37,420][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:32:37,919][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:32:38,426][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:32:38,934][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:32:39,436][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:32:39,941][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:32:40,446][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:32:40,949][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:32:41,452][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:32:41,952][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:32:42,452][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:32:42,952][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:32:43,453][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:32:43,958][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:32:44,458][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:32:44,962][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:32:45,463][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:32:45,963][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:32:46,465][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:32:46,966][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:32:47,468][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:32:47,976][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:32:48,477][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:32:48,979][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:32:49,482][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:32:49,985][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:32:50,491][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:32:50,993][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:32:51,493][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:32:51,992][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:32:52,492][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:32:52,996][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:32:53,496][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:32:54,003][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:32:54,520][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:32:55,026][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:32:55,541][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:32:56,055][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:32:56,571][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:32:57,081][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:32:57,587][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:32:58,096][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:32:58,602][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:32:59,110][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:32:59,632][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:33:00,137][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:33:00,644][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:33:01,153][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:33:01,658][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10141 tokens. [2025-11-13 06:33:02,578][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.58%, ΔTime: 00:00:33 [2025-11-13 06:33:03,324][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:33:03,326][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:33:03,328][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:33:04,315][__main__][INFO] - Iteration 536 took 1m 5s (43.51% Gen, 54.97% Train). Generation: 28s, Training: 35s. Estimated remaining time: 45h 56m 33s. Estimated total time: 54h 19m 56s. Time estimates for 10 more iterations: 10m 51s, 100 more iterations: 1h 48m 39s, 500 more iterations: 9h 3m 19s. [2025-11-13 06:33:04,317][__main__][INFO] - Starting iteration 536. [2025-11-13 06:33:04,807][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 06:33:04,808][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:33:23,746][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 1 y book, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 06:33:26,678][__main__][INFO] - Number of regex retries in iteration 536: 1 [2025-11-13 06:33:26,679][__main__][INFO] - agents played in iteration 536 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:33:27,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:33:27,498][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:33:27,522][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:33:27,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:33:27,545][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:33:27,546][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:33:28,396][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:33:28,861][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:33:29,375][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:33:29,883][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:33:30,387][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:33:30,894][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:33:31,458][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:33:32,867][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:33:33,362][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:33:33,875][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:33:34,385][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:33:34,897][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:33:35,407][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:33:35,917][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:33:36,431][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:33:36,939][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:33:37,452][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:33:37,961][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:33:38,465][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:33:38,977][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:33:39,481][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:33:39,989][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:33:40,491][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:33:41,000][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:33:41,508][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:33:42,016][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:33:42,523][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:33:43,029][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:33:43,541][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:33:44,063][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:33:44,568][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:33:45,073][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:33:45,583][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:33:46,085][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:33:46,588][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:33:47,096][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:33:47,600][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:33:48,102][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:33:48,609][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:33:49,112][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:33:49,631][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:33:50,139][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:33:50,662][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:33:51,169][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:33:51,673][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:33:52,178][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:33:52,683][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:33:53,188][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:33:53,697][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:33:54,210][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:33:54,730][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:33:55,238][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:33:55,754][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:33:56,257][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:33:56,763][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:33:57,272][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:33:57,781][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:33:58,288][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:33:58,801][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:33:59,317][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:33:59,830][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:34:00,339][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:34:00,849][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:34:01,359][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:34:01,868][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10146 tokens. [2025-11-13 06:34:02,752][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:34 [2025-11-13 06:34:03,434][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:34:03,436][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:34:03,438][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:34:04,429][__main__][INFO] - Iteration 537 took 59s (36.68% Gen, 61.65% Train). Generation: 21s, Training: 36s. Estimated remaining time: 41h 16m 46s. Estimated total time: 49h 41m 9s. Time estimates for 10 more iterations: 9m 56s, 100 more iterations: 1h 39m 22s, 500 more iterations: 8h 16m 51s. [2025-11-13 06:34:04,431][__main__][INFO] - Starting iteration 537. [2025-11-13 06:34:04,917][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 06:34:04,917][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:34:31,613][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 06:34:34,063][__main__][INFO] - Number of regex retries in iteration 537: 1 [2025-11-13 06:34:34,064][__main__][INFO] - agents played in iteration 537 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:34:34,952][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:34:34,979][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:34:35,006][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:34:35,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:34:35,029][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:34:35,030][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:34:35,806][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:34:36,272][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:34:36,782][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:34:37,292][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:34:37,800][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:34:38,307][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:34:38,813][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:34:39,332][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:34:39,838][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:34:40,361][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:34:40,865][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:34:41,377][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:34:41,881][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:34:42,389][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:34:42,896][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:34:43,402][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:34:43,907][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:34:44,416][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:34:44,919][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:34:45,428][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:34:45,932][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:34:46,436][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:34:46,944][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:34:47,450][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:34:47,969][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:34:48,490][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:34:48,996][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:34:49,500][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:34:50,005][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:34:50,505][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:34:51,007][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:34:51,508][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:34:52,009][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:34:52,511][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:34:53,013][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:34:53,518][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:34:54,019][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:34:54,520][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:34:55,024][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:34:55,529][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:34:56,031][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:34:56,534][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:34:57,035][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:34:57,553][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:34:58,055][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:34:58,558][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:34:59,063][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:34:59,566][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:35:00,070][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:35:00,573][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:35:01,076][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:35:01,585][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:35:02,090][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:35:02,593][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:35:03,102][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:35:03,610][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:35:04,119][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:35:04,625][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:35:05,125][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:35:05,643][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:35:06,147][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:35:06,664][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:35:07,166][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:35:07,670][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:35:08,180][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10114 tokens. [2025-11-13 06:35:09,057][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.22%, ΔTime: 00:00:33 [2025-11-13 06:35:09,793][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:35:09,794][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:35:09,796][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:35:10,790][__main__][INFO] - Iteration 538 took 1m 5s (44.25% Gen, 54.24% Train). Generation: 29s, Training: 35s. Estimated remaining time: 46h 28m 11s. Estimated total time: 54h 53m 40s. Time estimates for 10 more iterations: 10m 58s, 100 more iterations: 1h 49m 47s, 500 more iterations: 9h 8m 56s. [2025-11-13 06:35:10,792][__main__][INFO] - Starting iteration 538. [2025-11-13 06:35:11,280][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 06:35:11,281][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:35:38,742][__main__][INFO] - Number of regex retries in iteration 538: 0 [2025-11-13 06:35:38,745][__main__][INFO] - agents played in iteration 538 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:35:39,615][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:35:39,644][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:35:39,673][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:35:39,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:35:39,699][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:35:39,700][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:35:40,502][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:35:40,972][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:35:41,484][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:35:41,990][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:35:42,504][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:35:43,009][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:35:43,513][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:35:44,017][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:35:44,521][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:35:45,029][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:35:45,536][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:35:46,042][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:35:46,556][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:35:47,061][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:35:47,578][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:35:48,082][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:35:48,589][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:35:49,099][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:35:49,602][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:35:50,102][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:35:50,605][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:35:51,111][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:35:51,619][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:35:52,122][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:35:52,626][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:35:53,130][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:35:53,656][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:35:54,175][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:35:54,685][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:35:55,188][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:35:55,698][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:35:56,202][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:35:56,701][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:35:57,204][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:35:57,707][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:35:58,212][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:35:58,717][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:35:59,222][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:35:59,726][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:36:00,231][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:36:00,736][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:36:01,240][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:36:01,746][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:36:02,250][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:36:02,753][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:36:03,257][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:36:03,759][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:36:04,268][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:36:04,783][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:36:05,286][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:36:05,798][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:36:06,301][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:36:06,804][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:36:07,313][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:36:07,822][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:36:08,333][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:36:08,856][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:36:09,361][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:36:09,865][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:36:10,371][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:36:10,876][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:36:11,386][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:36:11,889][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:36:12,394][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:36:12,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10061 tokens. [2025-11-13 06:36:13,765][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.13%, ΔTime: 00:00:33 [2025-11-13 06:36:14,430][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:36:14,432][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:36:14,434][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:36:15,360][__main__][INFO] - Iteration 539 took 1m 4s (42.86% Gen, 55.69% Train). Generation: 27s, Training: 35s. Estimated remaining time: 44h 57m 29s. Estimated total time: 53h 24m 3s. Time estimates for 10 more iterations: 10m 40s, 100 more iterations: 1h 46m 48s, 500 more iterations: 8h 54m 0s. [2025-11-13 06:36:15,363][__main__][INFO] - Starting iteration 539. [2025-11-13 06:36:15,847][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 06:36:15,848][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:36:44,424][__main__][INFO] - Number of regex retries in iteration 539: 0 [2025-11-13 06:36:44,424][__main__][INFO] - agents played in iteration 539 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:36:45,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:36:45,272][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:36:45,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:36:45,322][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:36:45,322][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:36:45,323][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:36:46,127][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:36:46,587][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:36:47,097][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:36:47,601][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:36:48,106][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:36:48,618][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:36:49,120][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:36:49,631][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:36:50,138][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:36:50,644][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:36:51,152][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:36:51,662][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:36:52,169][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:36:52,713][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:36:53,221][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:36:53,726][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:36:54,235][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:36:54,747][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:36:55,260][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:36:55,764][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:36:56,266][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:36:56,788][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:36:57,293][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:36:57,801][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:36:58,304][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:36:58,812][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:36:59,323][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:36:59,834][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:37:00,347][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:37:00,864][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:37:01,368][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:37:01,881][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:37:02,389][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:37:02,899][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:37:03,410][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:37:03,913][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:37:04,419][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:37:04,928][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:37:05,432][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:37:05,936][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:37:06,442][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:37:06,948][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:37:07,458][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:37:07,961][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:37:08,482][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:37:08,983][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:37:09,487][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:37:09,986][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:37:10,487][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:37:10,997][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:37:11,499][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:37:11,999][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:37:12,505][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:37:13,016][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:37:13,520][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:37:14,028][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:37:14,535][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:37:15,038][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:37:15,542][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:37:16,044][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:37:16,558][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:37:17,064][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:37:17,573][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:37:18,079][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:37:18,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10161 tokens. [2025-11-13 06:37:19,462][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.07%, Current % of VRAM taken: 58.31%, Block Peak % of device VRAM: 62.36%, ΔTime: 00:00:33 [2025-11-13 06:37:20,209][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:37:20,210][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:37:20,212][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:37:21,215][__main__][INFO] - Iteration 540 took 1m 5s (43.72% Gen, 54.75% Train). Generation: 28s, Training: 35s. Estimated remaining time: 46h 0m 46s. Estimated total time: 54h 28m 26s. Time estimates for 10 more iterations: 10m 53s, 100 more iterations: 1h 48m 56s, 500 more iterations: 9h 4m 44s. [2025-11-13 06:37:21,217][__main__][INFO] - Starting iteration 540. [2025-11-13 06:37:21,702][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 53 and human policies 1. [2025-11-13 06:37:21,702][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:37:34,585][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 06:37:46,075][__main__][INFO] - Number of regex retries in iteration 540: 1 [2025-11-13 06:37:46,076][__main__][INFO] - agents played in iteration 540 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:37:46,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:37:46,952][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:37:46,975][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:37:46,997][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:37:46,998][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:37:46,999][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:37:47,830][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:37:48,294][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:37:48,803][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:37:49,310][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:37:49,818][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:37:50,320][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:37:50,823][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:37:51,325][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:37:51,827][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:37:52,332][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:37:52,836][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:37:53,341][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:37:53,849][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:37:54,352][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:37:54,860][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:37:55,365][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:37:55,879][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:37:56,384][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:37:56,891][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:37:57,398][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:37:57,906][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:37:58,411][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:37:58,914][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:37:59,420][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:37:59,927][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:38:00,430][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:38:00,939][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:38:01,455][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:38:01,960][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:38:02,476][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:38:02,986][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:38:03,491][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:38:04,000][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:38:04,513][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:38:05,022][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:38:05,529][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:38:06,036][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:38:06,559][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:38:07,067][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:38:07,577][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:38:08,085][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:38:08,590][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:38:09,097][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:38:09,604][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:38:10,112][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:38:10,621][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:38:11,125][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:38:11,626][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:38:12,129][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:38:12,637][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:38:13,147][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:38:13,652][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:38:14,158][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:38:14,664][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:38:15,168][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:38:15,682][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:38:16,187][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:38:16,693][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:38:17,203][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:38:17,708][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:38:18,213][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:38:18,721][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:38:19,227][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:38:19,737][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:38:20,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10115 tokens. [2025-11-13 06:38:21,139][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.05%, Current % of VRAM taken: 58.30%, Block Peak % of device VRAM: 62.42%, ΔTime: 00:00:33 [2025-11-13 06:38:21,778][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:38:21,782][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:38:21,785][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:38:23,627][__main__][INFO] - Iteration 541 took 1m 1s (39.36% Gen, 57.66% Train). Generation: 24s, Training: 35s. Estimated remaining time: 43h 7m 34s. Estimated total time: 51h 36m 16s. Time estimates for 10 more iterations: 10m 19s, 100 more iterations: 1h 43m 12s, 500 more iterations: 8h 36m 2s. [2025-11-13 06:38:23,629][__main__][INFO] - Starting iteration 541. [2025-11-13 06:38:24,128][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 06:38:24,129][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:38:50,455][__main__][INFO] - Number of regex retries in iteration 541: 0 [2025-11-13 06:38:50,455][__main__][INFO] - agents played in iteration 541 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:38:51,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:38:51,339][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:38:51,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:38:51,387][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.27%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:38:51,388][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:38:51,389][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:38:52,244][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:38:52,704][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:38:53,216][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:38:53,726][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:38:54,230][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:38:54,732][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:38:55,237][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:38:55,744][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:38:56,257][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:38:56,764][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:38:57,274][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:38:57,781][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:38:58,291][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:38:58,795][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:38:59,298][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:38:59,804][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:39:00,305][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:39:00,808][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:39:01,317][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:39:01,823][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:39:02,329][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:39:02,831][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:39:03,332][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:39:03,834][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:39:04,341][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:39:04,846][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:39:05,372][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:39:05,878][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:39:06,382][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:39:06,884][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:39:07,394][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:39:07,895][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:39:08,403][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:39:08,912][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:39:09,418][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:39:09,921][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:39:10,424][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:39:10,926][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:39:11,431][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:39:11,933][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:39:12,440][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:39:12,952][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:39:13,458][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:39:13,970][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:39:14,476][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:39:14,981][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:39:15,490][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:39:15,996][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:39:16,508][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:39:17,013][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:39:17,519][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:39:18,025][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:39:18,536][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:39:19,045][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:39:19,563][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:39:20,069][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:39:20,587][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:39:21,095][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:39:21,603][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:39:22,111][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:39:22,618][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:39:23,125][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:39:23,632][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:39:24,148][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:39:24,655][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10064 tokens. [2025-11-13 06:39:25,549][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.07%, ΔTime: 00:00:33 [2025-11-13 06:39:26,334][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:39:26,336][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:39:26,338][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:39:27,366][__main__][INFO] - Iteration 542 took 1m 3s (41.63% Gen, 56.74% Train). Generation: 26s, Training: 35s. Estimated remaining time: 44h 12m 10s. Estimated total time: 52h 41m 55s. Time estimates for 10 more iterations: 10m 32s, 100 more iterations: 1h 45m 23s, 500 more iterations: 8h 46m 59s. [2025-11-13 06:39:27,368][__main__][INFO] - Starting iteration 542. [2025-11-13 06:39:27,839][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 06:39:27,840][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:40:01,100][__main__][INFO] - Number of regex retries in iteration 542: 0 [2025-11-13 06:40:01,102][__main__][INFO] - agents played in iteration 542 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:40:02,049][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:40:02,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:40:02,095][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:40:02,119][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:40:02,119][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:40:02,120][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:40:02,932][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:40:03,407][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:40:03,920][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:40:04,432][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:40:04,941][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:40:05,447][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:40:05,959][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:40:06,468][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:40:06,982][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:40:07,491][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:40:07,999][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:40:08,514][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:40:09,021][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:40:09,526][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:40:10,034][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:40:10,544][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:40:11,054][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:40:11,560][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:40:12,063][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:40:12,574][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:40:13,080][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:40:13,585][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:40:14,094][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:40:14,599][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:40:15,108][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:40:15,616][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:40:16,121][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:40:16,630][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:40:17,137][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:40:17,643][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:40:18,147][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:40:18,653][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:40:19,159][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:40:19,665][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:40:20,171][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:40:20,684][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:40:21,189][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:40:21,700][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:40:22,205][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:40:22,711][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:40:23,220][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:40:23,723][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:40:24,230][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:40:24,735][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:40:25,241][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:40:25,750][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:40:26,255][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:40:26,759][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:40:27,264][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:40:27,769][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:40:28,289][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:40:28,794][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:40:29,305][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:40:29,810][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:40:30,321][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:40:30,830][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:40:31,337][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:40:31,843][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:40:32,358][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:40:32,865][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:40:33,383][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:40:33,890][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:40:34,396][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:40:34,904][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:40:35,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10108 tokens. [2025-11-13 06:40:36,316][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.19%, ΔTime: 00:00:33 [2025-11-13 06:40:36,952][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:40:36,954][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:40:36,956][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:40:37,928][__main__][INFO] - Iteration 543 took 1m 10s (47.45% Gen, 51.15% Train). Generation: 33s, Training: 35s. Estimated remaining time: 49h 53m 33s. Estimated total time: 58h 24m 30s. Time estimates for 10 more iterations: 11m 40s, 100 more iterations: 1h 56m 49s, 500 more iterations: 9h 44m 5s. [2025-11-13 06:40:37,931][__main__][INFO] - Starting iteration 543. [2025-11-13 06:40:38,403][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 06:40:38,403][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:41:01,885][__main__][INFO] - Number of regex retries in iteration 543: 0 [2025-11-13 06:41:01,886][__main__][INFO] - agents played in iteration 543 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:41:02,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:41:02,693][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:41:02,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:41:02,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:41:02,742][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:41:02,744][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:41:03,543][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:41:04,004][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:41:04,520][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:41:05,027][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:41:05,533][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:41:06,038][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:41:06,548][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:41:07,057][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:41:07,565][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:41:08,071][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:41:08,578][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:41:09,089][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:41:09,594][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:41:10,105][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:41:10,610][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:41:11,119][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:41:11,625][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:41:12,132][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:41:12,637][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:41:13,141][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:41:13,644][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:41:14,149][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:41:14,652][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:41:15,158][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:41:15,667][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:41:16,174][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:41:16,691][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:41:17,198][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:41:17,707][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:41:18,220][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:41:18,724][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:41:19,231][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:41:19,738][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:41:20,244][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:41:20,751][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:41:21,259][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:41:21,765][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:41:22,268][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:41:22,773][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:41:23,293][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:41:23,801][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:41:24,306][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:41:24,813][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:41:25,320][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:41:25,833][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:41:26,341][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:41:26,846][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:41:27,355][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:41:27,862][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:41:28,365][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:41:28,869][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:41:29,381][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:41:29,891][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:41:30,400][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:41:32,463][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:41:33,139][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:41:33,647][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:41:34,153][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:41:34,658][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:41:35,168][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:41:35,675][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:41:36,185][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:41:36,690][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:41:37,196][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:41:37,718][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10086 tokens. [2025-11-13 06:41:38,676][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.32%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:35 [2025-11-13 06:41:39,317][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:41:39,319][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:41:39,322][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:41:40,231][__main__][INFO] - Iteration 544 took 1m 1s (37.98% Gen, 60.55% Train). Generation: 23s, Training: 37s. Estimated remaining time: 42h 59m 29s. Estimated total time: 51h 31m 27s. Time estimates for 10 more iterations: 10m 18s, 100 more iterations: 1h 43m 2s, 500 more iterations: 8h 35m 14s. [2025-11-13 06:41:40,234][__main__][INFO] - Starting iteration 544. [2025-11-13 06:41:40,718][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 06:41:40,719][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:42:10,506][__main__][INFO] - Number of regex retries in iteration 544: 0 [2025-11-13 06:42:10,506][__main__][INFO] - agents played in iteration 544 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:42:11,363][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:42:11,390][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:42:11,415][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:42:11,438][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:42:11,439][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:42:11,440][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:42:12,288][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:42:12,745][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:42:13,260][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:42:13,765][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:42:14,269][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:42:14,774][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:42:15,280][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:42:15,802][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:42:16,310][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:42:16,815][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:42:17,326][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:42:17,830][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:42:18,337][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:42:18,845][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:42:19,349][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:42:19,855][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:42:20,360][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:42:20,871][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:42:21,371][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:42:21,869][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:42:22,374][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:42:22,882][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:42:23,380][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:42:23,884][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:42:24,393][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:42:24,921][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:42:25,429][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:42:25,936][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:42:26,440][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:42:26,946][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:42:27,452][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:42:27,961][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:42:28,470][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:42:28,978][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:42:29,489][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:42:30,037][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:42:30,544][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:42:31,056][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:42:31,566][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:42:32,074][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:42:32,582][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:42:33,106][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:42:33,613][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:42:34,132][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:42:34,639][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:42:35,147][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:42:35,655][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:42:36,162][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:42:36,670][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:42:37,179][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:42:37,689][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:42:38,207][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:42:38,716][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:42:39,230][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:42:39,736][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:42:40,248][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:42:40,758][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:42:41,265][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:42:41,774][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:42:42,280][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:42:42,786][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:42:43,304][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:42:43,816][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:42:44,323][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:42:44,831][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9973 tokens. [2025-11-13 06:42:45,721][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.10%, ΔTime: 00:00:33 [2025-11-13 06:42:46,493][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:42:46,495][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:42:46,497][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:42:47,610][__main__][INFO] - Iteration 545 took 1m 6s (44.53% Gen, 53.80% Train). Generation: 29s, Training: 35s. Estimated remaining time: 47h 11m 33s. Estimated total time: 55h 44m 39s. Time estimates for 10 more iterations: 11m 8s, 100 more iterations: 1h 51m 29s, 500 more iterations: 9h 17m 26s. [2025-11-13 06:42:47,613][__main__][INFO] - Starting iteration 545. [2025-11-13 06:42:48,094][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 06:42:48,095][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:43:14,865][__main__][INFO] - Number of regex retries in iteration 545: 0 [2025-11-13 06:43:14,867][__main__][INFO] - agents played in iteration 545 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:43:15,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:43:15,876][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:43:15,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:43:15,924][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:43:15,924][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:43:15,926][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:43:16,686][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:43:17,143][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:43:17,654][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:43:18,158][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:43:18,660][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:43:19,162][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:43:19,663][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:43:20,171][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:43:20,673][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:43:21,173][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:43:21,676][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:43:22,177][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:43:22,681][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:43:23,183][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:43:23,683][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:43:24,190][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:43:24,697][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:43:25,197][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:43:25,705][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:43:26,205][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:43:26,716][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:43:27,224][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:43:27,724][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:43:28,238][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:43:28,739][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:43:29,244][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:43:29,745][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:43:30,246][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:43:30,746][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:43:31,246][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:43:31,747][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:43:32,249][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:43:32,748][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:43:33,252][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:43:33,752][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:43:34,260][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:43:34,762][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:43:35,263][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:43:35,769][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:43:36,277][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:43:36,783][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:43:37,287][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:43:37,805][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:43:38,310][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:43:38,835][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:43:39,340][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:43:39,846][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:43:40,350][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:43:40,855][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:43:41,362][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:43:41,868][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:43:42,379][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:43:42,886][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:43:43,393][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:43:43,906][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:43:44,417][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:43:44,926][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:43:45,443][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:43:45,949][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:43:46,458][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:43:46,966][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:43:47,471][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:43:47,979][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:43:48,484][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:43:48,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10123 tokens. [2025-11-13 06:43:49,907][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.06%, Current % of VRAM taken: 58.30%, Block Peak % of device VRAM: 62.19%, ΔTime: 00:00:33 [2025-11-13 06:43:50,555][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:43:50,557][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:43:50,559][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:43:51,450][__main__][INFO] - Iteration 546 took 1m 3s (42.26% Gen, 56.33% Train). Generation: 26s, Training: 35s. Estimated remaining time: 44h 13m 40s. Estimated total time: 52h 47m 50s. Time estimates for 10 more iterations: 10m 33s, 100 more iterations: 1h 45m 35s, 500 more iterations: 8h 47m 58s. [2025-11-13 06:43:51,453][__main__][INFO] - Starting iteration 546. [2025-11-13 06:43:51,928][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 06:43:51,929][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:44:07,866][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 06:44:18,385][__main__][INFO] - Number of regex retries in iteration 546: 1 [2025-11-13 06:44:18,386][__main__][INFO] - agents played in iteration 546 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:44:19,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:44:19,212][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:44:19,237][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:44:19,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:44:19,275][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:44:19,276][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:44:20,036][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:44:20,493][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:44:21,007][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:44:21,511][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:44:22,033][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:44:22,538][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:44:23,044][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:44:23,543][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:44:24,052][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:44:24,555][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:44:25,056][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:44:25,560][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:44:26,064][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:44:26,565][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:44:27,065][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:44:27,569][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:44:28,069][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:44:28,570][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:44:29,076][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:44:29,577][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:44:30,079][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:44:30,583][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:44:31,082][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:44:31,586][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:44:32,090][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:44:32,598][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:44:33,102][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:44:33,631][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:44:34,155][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:44:34,659][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:44:35,163][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:44:35,668][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:44:36,171][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:44:36,678][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:44:37,179][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:44:37,683][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:44:38,190][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:44:38,687][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:44:39,187][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:44:39,684][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:44:40,183][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:44:40,687][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:44:41,186][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:44:41,690][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:44:42,198][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:44:42,703][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:44:43,216][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:44:43,721][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:44:44,226][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:44:44,734][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:44:45,241][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:44:45,750][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:44:46,256][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:44:46,767][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:44:47,279][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:44:47,785][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:44:48,292][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:44:48,800][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:44:49,306][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:44:49,825][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:44:50,331][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:44:50,839][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:44:51,344][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:44:51,850][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:44:52,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10059 tokens. [2025-11-13 06:44:53,261][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.40%, ΔTime: 00:00:33 [2025-11-13 06:44:54,006][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:44:54,009][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:44:54,011][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:44:55,092][__main__][INFO] - Iteration 547 took 1m 3s (41.88% Gen, 56.40% Train). Generation: 26s, Training: 35s. Estimated remaining time: 44h 3m 0s. Estimated total time: 52h 38m 14s. Time estimates for 10 more iterations: 10m 31s, 100 more iterations: 1h 45m 16s, 500 more iterations: 8h 46m 22s. [2025-11-13 06:44:55,095][__main__][INFO] - Starting iteration 547. [2025-11-13 06:44:55,660][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 06:44:55,661][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:45:04,290][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 06:45:13,897][__main__][INFO] - Number of regex retries in iteration 547: 1 [2025-11-13 06:45:13,900][__main__][INFO] - agents played in iteration 547 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:45:14,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:45:14,862][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:45:14,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:45:14,912][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:45:14,913][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:45:14,914][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:45:15,726][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:45:16,196][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:45:16,710][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:45:17,216][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:45:17,729][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:45:18,235][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:45:18,748][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:45:19,255][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:45:19,761][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:45:20,270][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:45:20,776][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:45:21,284][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:45:21,788][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:45:22,294][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:45:22,798][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:45:23,303][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:45:23,811][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:45:24,320][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:45:24,822][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:45:25,338][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:45:25,841][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:45:26,343][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:45:26,845][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:45:27,347][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:45:27,855][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:45:28,359][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:45:28,862][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:45:29,367][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:45:29,871][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:45:30,375][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:45:30,881][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:45:31,383][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:45:31,890][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:45:32,391][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:45:32,897][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:45:33,408][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:45:33,911][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:45:34,414][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:45:34,930][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:45:35,437][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:45:35,961][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:45:36,468][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:45:36,976][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:45:37,486][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:45:37,992][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:45:38,500][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:45:39,007][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:45:39,514][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:45:40,032][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:45:40,536][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:45:41,044][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:45:41,570][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:45:42,077][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:45:42,587][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:45:43,095][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:45:43,607][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:45:44,118][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:45:44,625][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:45:45,130][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:45:45,635][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:45:46,140][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:45:46,660][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:45:47,166][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:45:47,673][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:45:48,181][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10181 tokens. [2025-11-13 06:45:49,049][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.45%, ΔTime: 00:00:33 [2025-11-13 06:45:49,716][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:45:49,718][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:45:49,720][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:45:50,621][__main__][INFO] - Iteration 548 took 54s (33.18% Gen, 65.17% Train). Generation: 18s, Training: 35s. Estimated remaining time: 37h 11m 55s. Estimated total time: 45h 48m 4s. Time estimates for 10 more iterations: 9m 9s, 100 more iterations: 1h 31m 36s, 500 more iterations: 7h 38m 0s. [2025-11-13 06:45:50,623][__main__][INFO] - Starting iteration 548. [2025-11-13 06:45:51,110][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 06:45:51,112][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:46:25,044][__main__][INFO] - Number of regex retries in iteration 548: 0 [2025-11-13 06:46:25,044][__main__][INFO] - agents played in iteration 548 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:46:25,936][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:46:25,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:46:25,982][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:46:26,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:46:26,005][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:46:26,006][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:46:26,861][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:46:27,328][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:46:27,839][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:46:28,346][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:46:28,853][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:46:29,358][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:46:29,869][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:46:30,380][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:46:30,884][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:46:31,397][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:46:31,902][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:46:32,407][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:46:32,913][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:46:33,419][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:46:33,926][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:46:34,429][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:46:34,935][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:46:35,446][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:46:35,953][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:46:36,465][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:46:36,971][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:46:37,474][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:46:37,983][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:46:38,485][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:46:38,988][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:46:39,489][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:46:39,992][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:46:40,495][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:46:41,001][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:46:41,504][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:46:41,999][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:46:42,500][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:46:42,997][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:46:43,498][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:46:44,007][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:46:44,509][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:46:45,013][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:46:45,516][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:46:46,019][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:46:46,520][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:46:47,034][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:46:47,537][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:46:48,050][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:46:48,558][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:46:49,062][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:46:49,572][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:46:50,076][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:46:50,580][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:46:51,085][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:46:51,588][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:46:52,096][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:46:52,607][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:46:53,130][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:46:53,639][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:46:54,145][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:46:54,656][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:46:55,163][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:46:55,669][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:46:56,181][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:46:56,690][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:46:57,198][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:46:57,706][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:46:58,211][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:46:58,720][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:46:59,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10092 tokens. [2025-11-13 06:47:00,110][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.14%, ΔTime: 00:00:33 [2025-11-13 06:47:00,861][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:47:00,862][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:47:00,864][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:47:01,877][__main__][INFO] - Iteration 549 took 1m 10s (47.95% Gen, 50.62% Train). Generation: 33s, Training: 35s. Estimated remaining time: 50h 21m 2s. Estimated total time: 58h 58m 22s. Time estimates for 10 more iterations: 11m 47s, 100 more iterations: 1h 57m 56s, 500 more iterations: 9h 49m 43s. [2025-11-13 06:47:01,879][__main__][INFO] - Starting iteration 549. [2025-11-13 06:47:02,347][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 06:47:02,347][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:47:22,621][__main__][INFO] - Number of regex retries in iteration 549: 0 [2025-11-13 06:47:22,621][__main__][INFO] - agents played in iteration 549 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:47:23,469][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:47:23,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:47:23,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:47:23,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:47:23,542][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:47:23,542][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:47:24,288][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:47:24,746][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:47:25,251][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:47:25,750][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:47:26,251][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:47:26,750][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:47:27,253][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:47:27,766][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:47:28,267][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:47:28,766][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:47:29,272][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:47:29,779][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:47:30,288][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:47:30,786][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:47:31,287][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:47:31,796][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:47:32,300][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:47:32,802][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:47:33,306][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:47:33,806][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:47:34,309][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:47:34,813][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:47:35,312][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:47:35,818][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:47:36,319][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:47:36,819][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:47:37,319][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:47:37,822][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:47:38,322][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:47:38,822][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:47:39,324][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:47:39,823][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:47:40,323][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:47:40,826][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:47:41,324][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:47:41,823][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:47:42,326][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:47:42,827][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:47:43,324][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:47:43,823][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:47:44,322][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:47:44,828][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:47:45,333][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:47:45,840][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:47:46,345][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:47:47,632][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:47:48,391][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:47:48,892][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:47:49,397][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:47:49,900][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:47:50,405][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:47:50,907][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:47:51,417][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:47:51,922][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:47:52,431][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:47:52,934][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:47:53,439][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:47:53,949][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:47:54,456][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:47:54,963][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:47:55,475][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:47:55,983][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:47:56,498][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:47:57,006][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:47:57,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10126 tokens. [2025-11-13 06:47:58,424][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.16%, ΔTime: 00:00:34 [2025-11-13 06:47:59,082][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:47:59,084][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:47:59,086][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:47:59,947][__main__][INFO] - Iteration 550 took 57s (35.20% Gen, 63.31% Train). Generation: 20s, Training: 36s. Estimated remaining time: 39h 21m 43s. Estimated total time: 48h 0m 2s. Time estimates for 10 more iterations: 9m 36s, 100 more iterations: 1h 36m 0s, 500 more iterations: 8h 0m 0s. [2025-11-13 06:47:59,949][__main__][INFO] - Starting iteration 550. [2025-11-13 06:48:00,439][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 54 and human policies 1. [2025-11-13 06:48:00,441][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:48:22,704][__main__][INFO] - Number of regex retries in iteration 550: 0 [2025-11-13 06:48:22,705][__main__][INFO] - agents played in iteration 550 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:48:23,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:48:23,667][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:48:23,703][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:48:23,732][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:48:23,733][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:48:23,734][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:48:24,652][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:48:25,114][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:48:25,623][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:48:26,127][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:48:26,627][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:48:27,133][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:48:27,635][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:48:28,139][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:48:28,637][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:48:29,144][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:48:29,647][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:48:30,147][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:48:30,653][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:48:31,155][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:48:31,660][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:48:32,160][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:48:32,661][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:48:33,163][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:48:33,664][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:48:34,166][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:48:34,682][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:48:35,185][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:48:35,690][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:48:36,194][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:48:36,706][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:48:37,213][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:48:37,719][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:48:38,227][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:48:38,759][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:48:39,263][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:48:39,768][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:48:40,270][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:48:40,775][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:48:41,294][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:48:41,800][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:48:42,305][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:48:42,804][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:48:43,311][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:48:43,816][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:48:44,320][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:48:44,824][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:48:45,324][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:48:45,827][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:48:46,326][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:48:46,827][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:48:47,329][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:48:47,833][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:48:48,337][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:48:48,840][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:48:49,342][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:48:49,840][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:48:50,345][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:48:50,852][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:48:51,351][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:48:51,862][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:48:52,362][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:48:52,866][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:48:53,368][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:48:53,868][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:48:54,384][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:48:54,884][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:48:55,388][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:48:55,891][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:48:56,390][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:48:56,907][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10065 tokens. [2025-11-13 06:48:57,728][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.40%, ΔTime: 00:00:33 [2025-11-13 06:48:58,466][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:48:58,468][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:48:58,470][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:49:00,445][__main__][INFO] - Iteration 551 took 1m 0s (37.10% Gen, 59.60% Train). Generation: 22s, Training: 35s. Estimated remaining time: 41h 21m 1s. Estimated total time: 50h 0m 20s. Time estimates for 10 more iterations: 10m 0s, 100 more iterations: 1h 40m 0s, 500 more iterations: 8h 20m 3s. [2025-11-13 06:49:00,447][__main__][INFO] - Starting iteration 551. [2025-11-13 06:49:00,975][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 06:49:00,976][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:49:15,104][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 06:49:26,963][__main__][INFO] - Number of regex retries in iteration 551: 1 [2025-11-13 06:49:26,966][__main__][INFO] - agents played in iteration 551 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:49:27,815][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:49:27,867][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:49:27,901][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:49:27,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:49:27,932][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:49:27,934][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:49:28,785][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:49:29,249][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:49:29,759][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:49:30,263][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:49:30,771][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:49:31,282][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:49:31,787][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:49:32,299][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:49:32,804][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:49:33,309][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:49:33,823][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:49:34,332][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:49:34,844][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:49:35,347][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:49:35,855][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:49:36,370][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:49:36,879][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:49:37,382][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:49:37,887][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:49:38,388][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:49:38,898][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:49:39,403][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:49:39,903][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:49:40,409][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:49:40,909][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:49:41,411][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:49:41,916][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:49:42,418][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:49:42,924][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:49:43,431][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:49:43,937][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:49:44,439][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:49:44,940][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:49:45,442][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:49:45,942][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:49:46,444][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:49:46,976][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:49:47,481][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:49:47,994][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:49:48,497][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:49:49,004][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:49:49,507][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:49:50,013][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:49:50,514][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:49:51,016][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:49:51,522][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:49:52,027][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:49:52,534][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:49:53,040][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:49:53,543][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:49:54,048][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:49:54,555][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:49:55,070][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:49:55,573][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:49:56,094][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:49:56,601][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:49:57,112][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:49:57,618][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:49:58,124][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:49:58,631][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:49:59,138][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:49:59,647][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:50:00,155][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:50:00,661][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:50:01,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10127 tokens. [2025-11-13 06:50:02,048][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.17%, ΔTime: 00:00:33 [2025-11-13 06:50:02,702][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:50:02,704][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:50:02,706][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:50:03,610][__main__][INFO] - Iteration 552 took 1m 2s (41.49% Gen, 57.06% Train). Generation: 25s, Training: 35s. Estimated remaining time: 43h 31m 23s. Estimated total time: 52h 11m 45s. Time estimates for 10 more iterations: 10m 26s, 100 more iterations: 1h 44m 23s, 500 more iterations: 8h 41m 57s. [2025-11-13 06:50:03,613][__main__][INFO] - Starting iteration 552. [2025-11-13 06:50:04,097][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 06:50:04,097][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:50:32,606][__main__][INFO] - Number of regex retries in iteration 552: 0 [2025-11-13 06:50:32,607][__main__][INFO] - agents played in iteration 552 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:50:33,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:50:33,478][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:50:33,504][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:50:33,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:50:33,527][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:50:33,528][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:50:34,296][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:50:34,758][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:50:35,268][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:50:35,772][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:50:36,279][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:50:36,784][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:50:37,287][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:50:37,788][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:50:38,289][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:50:38,789][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:50:39,299][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:50:39,800][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:50:40,300][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:50:40,808][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:50:41,309][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:50:41,822][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:50:42,321][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:50:42,823][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:50:43,332][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:50:43,835][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:50:44,341][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:50:44,846][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:50:45,347][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:50:45,850][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:50:46,355][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:50:46,858][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:50:47,362][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:50:47,862][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:50:48,358][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:50:48,864][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:50:49,363][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:50:49,860][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:50:50,360][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:50:50,861][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:50:51,367][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:50:51,866][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:50:52,367][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:50:52,867][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:50:53,372][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:50:53,873][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:50:54,375][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:50:54,880][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:50:55,400][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:50:55,906][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:50:56,407][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:50:56,908][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:50:57,408][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:50:57,924][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:50:58,432][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:50:58,939][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:50:59,439][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:50:59,944][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:51:00,451][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:51:00,958][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:51:01,463][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:51:01,969][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:51:02,474][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:51:02,980][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:51:03,485][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:51:03,993][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:51:04,510][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:51:05,015][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:51:05,518][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:51:06,023][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:51:06,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10048 tokens. [2025-11-13 06:51:07,433][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.07%, Current % of VRAM taken: 58.32%, Block Peak % of device VRAM: 62.21%, ΔTime: 00:00:33 [2025-11-13 06:51:08,149][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:51:08,151][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:51:08,152][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:51:09,124][__main__][INFO] - Iteration 553 took 1m 5s (43.84% Gen, 54.66% Train). Generation: 28s, Training: 35s. Estimated remaining time: 45h 29m 56s. Estimated total time: 54h 11m 24s. Time estimates for 10 more iterations: 10m 50s, 100 more iterations: 1h 48m 22s, 500 more iterations: 9h 1m 54s. [2025-11-13 06:51:09,126][__main__][INFO] - Starting iteration 553. [2025-11-13 06:51:09,609][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 06:51:09,610][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:51:35,723][__main__][INFO] - Number of regex retries in iteration 553: 0 [2025-11-13 06:51:35,726][__main__][INFO] - agents played in iteration 553 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:51:36,700][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:51:36,728][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:51:36,753][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:51:36,775][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:51:36,776][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:51:36,777][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:51:37,573][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:51:38,030][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:51:38,537][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:51:39,038][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:51:39,533][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:51:40,044][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:51:40,544][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:51:41,051][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:51:41,555][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:51:42,058][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:51:42,562][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:51:43,072][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:51:43,575][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:51:44,079][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:51:44,583][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:51:45,084][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:51:45,589][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:51:46,093][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:51:46,605][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:51:47,105][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:51:47,609][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:51:48,118][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:51:48,618][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:51:49,131][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:51:49,630][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:51:50,131][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:51:50,640][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:51:51,142][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:51:51,650][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:51:52,153][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:51:52,654][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:51:53,160][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:51:53,663][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:51:54,167][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:51:54,668][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:51:55,167][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:51:55,665][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:51:56,164][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:51:56,666][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:51:57,172][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:51:57,674][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:51:58,177][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:51:58,684][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:51:59,188][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:51:59,693][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:52:00,228][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:52:00,731][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:52:01,235][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:52:01,739][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:52:02,249][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:52:02,757][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:52:03,264][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:52:03,770][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:52:04,275][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:52:04,780][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:52:05,287][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:52:05,791][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:52:06,302][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:52:06,807][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:52:07,317][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:52:07,826][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:52:08,331][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:52:08,837][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:52:09,343][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:52:09,848][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10035 tokens. [2025-11-13 06:52:10,729][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.16%, ΔTime: 00:00:33 [2025-11-13 06:52:11,385][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:52:11,387][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:52:11,388][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:52:12,302][__main__][INFO] - Iteration 554 took 1m 2s (41.66% Gen, 56.88% Train). Generation: 26s, Training: 35s. Estimated remaining time: 43h 32m 11s. Estimated total time: 52h 14m 42s. Time estimates for 10 more iterations: 10m 26s, 100 more iterations: 1h 44m 29s, 500 more iterations: 8h 42m 27s. [2025-11-13 06:52:12,305][__main__][INFO] - Starting iteration 554. [2025-11-13 06:52:12,784][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 06:52:12,785][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:52:44,395][__main__][INFO] - Number of regex retries in iteration 554: 0 [2025-11-13 06:52:44,396][__main__][INFO] - agents played in iteration 554 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:52:45,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:52:45,287][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:52:45,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:52:45,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:52:45,336][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:52:45,337][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:52:46,189][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:52:46,666][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:52:47,173][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:52:47,682][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:52:48,184][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:52:48,694][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:52:49,192][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:52:49,698][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:52:50,203][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:52:50,705][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:52:51,209][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:52:51,711][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:52:52,211][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:52:52,714][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:52:53,220][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:52:53,722][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:52:54,227][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:52:54,728][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:52:55,232][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:52:55,734][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:52:56,240][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:52:56,742][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:52:57,244][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:52:57,764][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:52:58,269][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:52:58,781][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:52:59,284][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:52:59,786][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:53:00,291][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:53:00,798][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:53:01,301][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:53:01,806][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:53:02,308][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:53:02,812][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:53:03,315][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:53:03,818][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:53:04,326][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:53:04,841][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:53:05,348][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:53:05,859][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:53:06,368][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:53:06,873][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:53:07,376][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:53:07,885][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:53:08,390][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:53:08,894][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:53:09,399][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:53:09,906][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:53:10,410][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:53:10,921][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:53:11,429][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:53:11,935][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:53:12,448][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:53:12,955][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:53:13,469][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:53:13,973][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:53:14,483][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:53:14,989][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:53:15,500][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:53:16,010][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:53:16,515][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:53:17,019][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:53:17,534][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:53:18,037][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:53:18,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10103 tokens. [2025-11-13 06:53:19,435][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 06:53:20,199][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:53:20,201][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:53:20,202][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:53:21,196][__main__][INFO] - Iteration 555 took 1m 8s (46.21% Gen, 52.34% Train). Generation: 31s, Training: 35s. Estimated remaining time: 48h 16m 57s. Estimated total time: 57h 0m 37s. Time estimates for 10 more iterations: 11m 24s, 100 more iterations: 1h 54m 1s, 500 more iterations: 9h 30m 6s. [2025-11-13 06:53:21,199][__main__][INFO] - Starting iteration 555. [2025-11-13 06:53:21,686][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 06:53:21,686][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:53:44,334][__main__][INFO] - Number of regex retries in iteration 555: 0 [2025-11-13 06:53:44,334][__main__][INFO] - agents played in iteration 555 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:53:45,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:53:45,143][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:53:45,165][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:53:45,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:53:45,187][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:53:45,188][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:53:45,929][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:53:46,389][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:53:46,892][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:53:47,391][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:53:47,892][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:53:48,397][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:53:48,895][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:53:49,394][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:53:49,897][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:53:50,405][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:53:50,908][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:53:51,418][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:53:51,922][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:53:52,425][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:53:52,929][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:53:53,435][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:53:53,938][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:53:54,453][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:53:54,960][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:53:55,460][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:53:55,961][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:53:56,464][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:53:56,974][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:53:57,483][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:53:57,985][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:53:59,400][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:53:59,911][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:54:00,419][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:54:00,929][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:54:01,437][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:54:01,944][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:54:02,453][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:54:02,952][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:54:03,461][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:54:03,968][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:54:04,475][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:54:04,986][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:54:05,492][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:54:06,002][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:54:06,511][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:54:07,020][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:54:07,526][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:54:08,033][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:54:08,539][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:54:09,052][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:54:09,556][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:54:10,070][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:54:10,578][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:54:11,086][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:54:11,592][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:54:12,101][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:54:12,609][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:54:13,117][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:54:13,626][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:54:14,136][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:54:14,645][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:54:15,151][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:54:15,660][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:54:16,169][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:54:16,681][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:54:17,187][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:54:17,694][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:54:18,201][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:54:18,707][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:54:19,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10250 tokens. [2025-11-13 06:54:20,104][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.33%, ΔTime: 00:00:34 [2025-11-13 06:54:20,757][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:54:20,759][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:54:20,760][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:54:21,652][__main__][INFO] - Iteration 556 took 59s (37.77% Gen, 60.74% Train). Generation: 22s, Training: 36s. Estimated remaining time: 41h 13m 41s. Estimated total time: 49h 58m 21s. Time estimates for 10 more iterations: 9m 59s, 100 more iterations: 1h 39m 56s, 500 more iterations: 8h 19m 43s. [2025-11-13 06:54:21,654][__main__][INFO] - Starting iteration 556. [2025-11-13 06:54:22,135][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 06:54:22,136][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:54:53,976][__main__][INFO] - Number of regex retries in iteration 556: 0 [2025-11-13 06:54:53,977][__main__][INFO] - agents played in iteration 556 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:54:54,823][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:54:54,851][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:54:54,876][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:54:54,899][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:54:54,900][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:54:54,901][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:54:55,659][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:54:56,152][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:54:56,661][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:54:57,162][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:54:57,667][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:54:58,173][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:54:58,674][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:54:59,184][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:54:59,696][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:55:00,218][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:55:00,730][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:55:01,243][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:55:01,750][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:55:02,252][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:55:02,768][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:55:03,269][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:55:03,771][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:55:04,273][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:55:04,779][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:55:05,282][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:55:05,784][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:55:06,290][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:55:06,800][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:55:07,306][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:55:07,810][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:55:08,315][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:55:08,818][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:55:09,322][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:55:09,825][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:55:10,328][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:55:10,835][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:55:11,340][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:55:11,852][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:55:12,354][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:55:12,861][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:55:13,366][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:55:13,872][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:55:14,381][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:55:14,885][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:55:15,389][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:55:15,897][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:55:16,399][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:55:16,905][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:55:17,411][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:55:17,918][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:55:18,427][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:55:18,931][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:55:19,436][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:55:19,939][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:55:20,449][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:55:20,957][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:55:21,465][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:55:21,974][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:55:22,489][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:55:22,999][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:55:23,508][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:55:24,015][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:55:24,524][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:55:25,037][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:55:25,546][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:55:26,055][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:55:26,559][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:55:27,064][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:55:27,569][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:55:28,075][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10100 tokens. [2025-11-13 06:55:28,949][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.11%, ΔTime: 00:00:33 [2025-11-13 06:55:29,662][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:55:29,664][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:55:29,665][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:55:30,647][__main__][INFO] - Iteration 557 took 1m 8s (46.47% Gen, 52.09% Train). Generation: 31s, Training: 35s. Estimated remaining time: 48h 19m 48s. Estimated total time: 57h 5m 37s. Time estimates for 10 more iterations: 11m 25s, 100 more iterations: 1h 54m 11s, 500 more iterations: 9h 30m 56s. [2025-11-13 06:55:30,649][__main__][INFO] - Starting iteration 557. [2025-11-13 06:55:31,123][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 06:55:31,124][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:55:46,423][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 06:55:57,108][__main__][INFO] - Number of regex retries in iteration 557: 1 [2025-11-13 06:55:57,110][__main__][INFO] - agents played in iteration 557 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:55:58,025][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:55:58,072][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:55:58,107][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:55:58,137][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:55:58,138][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:55:58,138][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:55:58,978][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:55:59,450][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:55:59,959][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:56:00,465][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:56:00,973][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:56:01,477][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:56:01,990][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:56:02,496][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:56:03,008][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:56:03,515][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:56:04,028][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:56:04,536][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:56:05,044][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:56:05,549][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:56:06,051][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:56:06,559][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:56:07,065][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:56:07,571][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:56:08,074][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:56:08,582][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:56:09,087][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:56:09,593][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:56:10,109][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:56:10,614][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:56:11,119][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:56:11,628][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:56:12,131][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:56:12,634][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:56:13,143][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:56:13,649][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:56:14,160][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:56:14,665][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:56:15,168][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:56:15,678][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:56:16,185][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:56:16,705][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:56:17,212][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:56:17,716][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:56:18,221][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:56:18,728][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:56:19,235][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:56:19,740][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:56:20,276][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:56:20,787][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:56:21,292][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:56:21,804][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:56:22,310][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:56:22,820][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:56:23,329][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:56:23,838][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:56:24,342][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:56:24,849][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:56:25,353][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:56:25,861][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:56:26,367][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:56:26,872][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:56:27,376][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:56:27,883][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:56:28,404][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:56:28,911][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:56:29,419][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:56:29,928][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:56:30,436][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:56:30,944][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:56:31,448][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10130 tokens. [2025-11-13 06:56:32,339][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:33 [2025-11-13 06:56:32,972][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:56:32,974][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:56:32,976][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:56:33,863][__main__][INFO] - Iteration 558 took 1m 2s (41.42% Gen, 57.17% Train). Generation: 25s, Training: 35s. Estimated remaining time: 43h 30m 10s. Estimated total time: 52h 17m 2s. Time estimates for 10 more iterations: 10m 27s, 100 more iterations: 1h 44m 34s, 500 more iterations: 8h 42m 50s. [2025-11-13 06:56:33,866][__main__][INFO] - Starting iteration 558. [2025-11-13 06:56:34,343][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 06:56:34,344][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:57:08,240][__main__][INFO] - Number of regex retries in iteration 558: 0 [2025-11-13 06:57:08,240][__main__][INFO] - agents played in iteration 558 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:57:09,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:57:09,053][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:57:09,081][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:57:09,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:57:09,107][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:57:09,108][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:57:09,878][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:57:10,337][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:57:10,850][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:57:11,355][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:57:11,857][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:57:12,360][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:57:12,865][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:57:13,367][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:57:13,867][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:57:14,366][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:57:14,869][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:57:15,373][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:57:15,893][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:57:16,393][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:57:16,893][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:57:17,399][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:57:17,901][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:57:18,404][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:57:18,904][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:57:19,405][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:57:19,916][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:57:20,416][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:57:20,921][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:57:21,425][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:57:21,926][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:57:22,429][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:57:22,931][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:57:23,438][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:57:23,946][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:57:24,453][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:57:24,961][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:57:25,471][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:57:25,975][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:57:26,483][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:57:26,989][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:57:27,497][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:57:28,006][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:57:28,523][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:57:29,032][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:57:29,547][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:57:30,052][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:57:30,559][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:57:31,066][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:57:31,595][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:57:32,101][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:57:32,610][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:57:33,123][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:57:33,630][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:57:34,139][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:57:34,644][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:57:35,150][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:57:35,659][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:57:36,167][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:57:36,699][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:57:38,696][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:57:39,216][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:57:39,722][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:57:40,227][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:57:40,733][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:57:41,238][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:57:41,742][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:57:42,249][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:57:42,753][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:57:43,264][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:57:43,769][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10097 tokens. [2025-11-13 06:57:44,689][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.05%, Current % of VRAM taken: 58.30%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:34 [2025-11-13 06:57:45,332][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:57:45,333][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:57:45,335][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:57:46,248][__main__][INFO] - Iteration 559 took 1m 11s (47.14% Gen, 51.59% Train). Generation: 33s, Training: 37s. Estimated remaining time: 51h 7m 12s. Estimated total time: 59h 55m 16s. Time estimates for 10 more iterations: 11m 59s, 100 more iterations: 1h 59m 50s, 500 more iterations: 9h 59m 12s. [2025-11-13 06:57:46,252][__main__][INFO] - Starting iteration 559. [2025-11-13 06:57:46,732][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 06:57:46,732][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:58:08,017][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 06:58:18,464][__main__][INFO] - Number of regex retries in iteration 559: 1 [2025-11-13 06:58:18,465][__main__][INFO] - agents played in iteration 559 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:58:19,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:58:19,366][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:58:19,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:58:19,416][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:58:19,417][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:58:19,418][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:58:20,239][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:58:20,695][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:58:21,213][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:58:21,715][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:58:22,216][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:58:22,727][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:58:23,227][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:58:23,733][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:58:24,234][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:58:24,734][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:58:25,243][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:58:25,746][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:58:26,253][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:58:26,757][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:58:27,263][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:58:27,768][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:58:28,274][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:58:28,781][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:58:29,310][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:58:29,818][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:58:30,321][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:58:30,824][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:58:31,326][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:58:31,834][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:58:32,336][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:58:32,838][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:58:33,345][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:58:33,850][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:58:34,354][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:58:34,859][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:58:35,364][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:58:35,870][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:58:36,378][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:58:36,909][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:58:37,433][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:58:37,938][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:58:38,446][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:58:38,955][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:58:39,465][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:58:39,969][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:58:40,475][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:58:40,981][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:58:41,488][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:58:41,994][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:58:42,499][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:58:43,007][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:58:43,515][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:58:44,038][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:58:44,548][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:58:45,052][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:58:45,556][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:58:46,061][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:58:46,568][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:58:47,073][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:58:47,583][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:58:48,094][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:58:48,602][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:58:49,107][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:58:49,612][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:58:50,119][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 06:58:50,636][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 06:58:51,143][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 06:58:51,647][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 06:58:52,154][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 06:58:52,662][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10022 tokens. [2025-11-13 06:58:53,565][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.10%, ΔTime: 00:00:33 [2025-11-13 06:58:54,344][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 06:58:54,346][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 06:58:54,349][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 06:58:55,320][__main__][INFO] - Iteration 560 took 1m 8s (46.26% Gen, 52.32% Train). Generation: 31s, Training: 35s. Estimated remaining time: 48h 20m 13s. Estimated total time: 57h 9m 27s. Time estimates for 10 more iterations: 11m 25s, 100 more iterations: 1h 54m 18s, 500 more iterations: 9h 31m 34s. [2025-11-13 06:58:55,322][__main__][INFO] - Starting iteration 560. [2025-11-13 06:58:55,909][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 55 and human policies 1. [2025-11-13 06:58:55,910][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 06:59:27,108][__main__][INFO] - Number of regex retries in iteration 560: 0 [2025-11-13 06:59:27,109][__main__][INFO] - agents played in iteration 560 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 06:59:27,961][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:59:27,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:59:28,014][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:59:28,037][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 06:59:28,038][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 06:59:28,038][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 06:59:28,844][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 06:59:29,303][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 06:59:29,813][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 06:59:30,327][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 06:59:30,834][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 06:59:31,342][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 06:59:31,845][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 06:59:32,349][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 06:59:32,859][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 06:59:33,361][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 06:59:33,864][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 06:59:34,366][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 06:59:34,871][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 06:59:35,375][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 06:59:35,877][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 06:59:36,387][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 06:59:36,900][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 06:59:37,403][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 06:59:37,927][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 06:59:38,432][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 06:59:38,937][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 06:59:39,444][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 06:59:39,950][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 06:59:40,459][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 06:59:40,966][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 06:59:41,473][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 06:59:41,978][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 06:59:42,484][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 06:59:42,989][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 06:59:43,498][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 06:59:44,007][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 06:59:44,527][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 06:59:45,037][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 06:59:45,544][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 06:59:46,046][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 06:59:46,553][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 06:59:47,063][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 06:59:47,567][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 06:59:48,072][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 06:59:48,585][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 06:59:50,318][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 06:59:50,815][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 06:59:51,329][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 06:59:51,837][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 06:59:52,344][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 06:59:52,850][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 06:59:53,354][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 06:59:53,860][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 06:59:54,363][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 06:59:54,868][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 06:59:55,375][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 06:59:55,882][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 06:59:56,394][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 06:59:56,903][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 06:59:57,412][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 06:59:57,926][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 06:59:58,434][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 06:59:58,942][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 06:59:59,447][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 06:59:59,950][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:00:00,460][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:00:00,964][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:00:01,467][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:00:01,969][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:00:02,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10141 tokens. [2025-11-13 07:00:03,386][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:34 [2025-11-13 07:00:04,044][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:00:04,046][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:00:04,048][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:00:05,938][__main__][INFO] - Iteration 561 took 1m 10s (44.55% Gen, 52.75% Train). Generation: 31s, Training: 36s. Estimated remaining time: 49h 31m 4s. Estimated total time: 58h 21m 28s. Time estimates for 10 more iterations: 11m 40s, 100 more iterations: 1h 56m 42s, 500 more iterations: 9h 43m 34s. [2025-11-13 07:00:05,941][__main__][INFO] - Starting iteration 561. [2025-11-13 07:00:06,430][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 07:00:06,431][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:00:41,291][__main__][INFO] - Number of regex retries in iteration 561: 0 [2025-11-13 07:00:41,291][__main__][INFO] - agents played in iteration 561 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:00:42,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:00:42,169][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:00:42,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:00:42,216][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:00:42,216][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:00:42,217][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:00:43,079][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:00:43,542][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:00:44,063][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:00:44,570][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:00:45,077][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:00:45,592][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:00:46,101][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:00:46,611][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:00:47,117][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:00:47,624][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:00:48,143][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:00:48,653][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:00:49,163][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:00:49,669][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:00:50,182][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:00:50,690][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:00:51,201][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:00:51,704][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:00:52,213][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:00:52,716][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:00:53,224][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:00:53,729][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:00:54,231][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:00:54,748][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:00:55,256][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:00:55,762][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:00:56,265][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:00:56,771][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:00:57,286][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:00:57,798][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:00:58,303][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:00:58,811][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:00:59,322][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:00:59,824][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:01:00,328][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:01:00,837][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:01:01,355][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:01:01,859][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:01:02,364][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:01:02,871][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:01:03,379][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:01:03,887][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:01:04,392][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:01:04,901][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:01:05,414][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:01:05,922][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:01:06,428][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:01:06,935][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:01:07,441][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:01:07,961][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:01:08,467][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:01:08,974][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:01:09,477][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:01:09,988][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:01:10,496][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:01:10,999][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:01:11,504][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:01:12,008][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:01:12,540][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:01:13,054][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:01:13,560][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:01:14,068][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:01:14,577][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:01:15,080][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:01:15,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10190 tokens. [2025-11-13 07:01:16,394][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.17%, ΔTime: 00:00:33 [2025-11-13 07:01:17,143][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:01:17,144][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:01:17,146][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:01:18,146][__main__][INFO] - Iteration 562 took 1m 11s (48.61% Gen, 50.00% Train). Generation: 34s, Training: 35s. Estimated remaining time: 50h 54m 14s. Estimated total time: 59h 45m 51s. Time estimates for 10 more iterations: 11m 57s, 100 more iterations: 1h 59m 31s, 500 more iterations: 9h 57m 38s. [2025-11-13 07:01:18,149][__main__][INFO] - Starting iteration 562. [2025-11-13 07:01:18,632][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 07:01:18,633][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:01:39,112][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:01:45,432][__main__][INFO] - Number of regex retries in iteration 562: 1 [2025-11-13 07:01:45,432][__main__][INFO] - agents played in iteration 562 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:01:46,208][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:01:46,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:01:46,261][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:01:46,284][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:01:46,285][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:01:46,286][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:01:47,032][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:01:47,492][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:01:48,004][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:01:48,516][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:01:49,023][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:01:49,529][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:01:50,034][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:01:50,534][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:01:51,042][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:01:51,548][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:01:52,052][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:01:52,570][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:01:53,074][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:01:53,583][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:01:54,087][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:01:54,592][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:01:55,099][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:01:55,603][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:01:56,108][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:01:56,612][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:01:57,114][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:01:57,622][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:01:58,129][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:01:58,638][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:02:00,011][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:02:01,607][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:02:02,120][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:02:02,629][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:02:03,140][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:02:03,651][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:02:04,158][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:02:04,666][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:02:05,180][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:02:05,692][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:02:06,198][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:02:06,705][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:02:07,210][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:02:07,720][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:02:08,225][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:02:08,729][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:02:09,236][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:02:09,749][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:02:10,252][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:02:10,757][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:02:11,262][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:02:11,768][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:02:12,287][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:02:12,791][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:02:13,299][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:02:13,805][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:02:14,315][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:02:14,823][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:02:15,333][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:02:15,837][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:02:16,348][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:02:16,852][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:02:17,357][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:02:17,863][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:02:18,368][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:02:18,890][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:02:19,394][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:02:19,902][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:02:20,408][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:02:20,914][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:02:21,422][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10118 tokens. [2025-11-13 07:02:22,332][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.40%, ΔTime: 00:00:35 [2025-11-13 07:02:22,975][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:02:22,977][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:02:22,979][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:02:23,876][__main__][INFO] - Iteration 563 took 1m 5s (41.08% Gen, 57.55% Train). Generation: 26s, Training: 37s. Estimated remaining time: 45h 29m 29s. Estimated total time: 54h 22m 11s. Time estimates for 10 more iterations: 10m 52s, 100 more iterations: 1h 48m 44s, 500 more iterations: 9h 3m 41s. [2025-11-13 07:02:23,878][__main__][INFO] - Starting iteration 563. [2025-11-13 07:02:24,375][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 07:02:24,376][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:02:55,502][__main__][INFO] - Number of regex retries in iteration 563: 0 [2025-11-13 07:02:55,503][__main__][INFO] - agents played in iteration 563 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:02:56,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:02:56,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:02:56,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:02:56,558][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:02:56,558][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:02:56,559][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:02:57,412][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:02:57,874][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:02:58,395][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:02:58,900][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:02:59,411][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:02:59,922][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:03:00,432][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:03:00,938][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:03:01,443][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:03:01,948][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:03:02,451][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:03:02,956][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:03:03,472][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:03:03,975][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:03:04,480][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:03:04,987][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:03:05,500][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:03:06,011][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:03:06,519][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:03:07,027][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:03:07,534][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:03:08,043][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:03:08,557][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:03:09,071][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:03:09,578][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:03:10,102][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:03:10,612][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:03:11,118][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:03:11,623][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:03:12,129][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:03:12,636][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:03:13,141][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:03:13,649][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:03:14,166][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:03:14,667][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:03:15,171][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:03:15,677][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:03:16,182][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:03:16,687][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:03:17,190][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:03:17,700][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:03:18,205][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:03:18,710][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:03:19,217][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:03:19,723][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:03:20,228][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:03:20,735][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:03:21,242][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:03:21,755][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:03:22,258][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:03:22,763][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:03:23,289][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:03:23,792][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:03:24,297][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:03:24,802][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:03:25,313][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:03:25,821][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:03:26,331][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:03:26,836][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:03:27,341][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:03:27,850][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:03:28,382][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:03:28,887][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:03:29,394][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:03:29,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10103 tokens. [2025-11-13 07:03:30,696][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.19%, ΔTime: 00:00:33 [2025-11-13 07:03:31,416][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:03:31,417][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:03:31,419][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:03:32,438][__main__][INFO] - Iteration 564 took 1m 8s (45.73% Gen, 52.77% Train). Generation: 31s, Training: 35s. Estimated remaining time: 47h 49m 20s. Estimated total time: 56h 43m 11s. Time estimates for 10 more iterations: 11m 20s, 100 more iterations: 1h 53m 26s, 500 more iterations: 9h 27m 11s. [2025-11-13 07:03:32,440][__main__][INFO] - Starting iteration 564. [2025-11-13 07:03:32,926][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 07:03:32,927][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:03:52,950][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:04:01,345][__main__][INFO] - Number of regex retries in iteration 564: 1 [2025-11-13 07:04:01,346][__main__][INFO] - agents played in iteration 564 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:04:02,183][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:04:02,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:04:02,252][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:04:02,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:04:02,279][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:04:02,281][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:04:03,141][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:04:03,603][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:04:04,117][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:04:04,625][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:04:05,130][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:04:05,636][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:04:06,142][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:04:06,648][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:04:07,153][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:04:07,666][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:04:08,190][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:04:08,701][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:04:09,209][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:04:09,717][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:04:10,221][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:04:10,726][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:04:11,234][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:04:11,739][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:04:12,250][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:04:12,757][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:04:13,264][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:04:13,770][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:04:14,273][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:04:14,793][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:04:15,299][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:04:15,808][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:04:16,316][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:04:16,827][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:04:17,340][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:04:17,849][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:04:18,355][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:04:18,864][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:04:19,371][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:04:19,889][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:04:20,394][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:04:20,905][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:04:21,411][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:04:21,916][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:04:22,424][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:04:22,931][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:04:23,437][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:04:23,944][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:04:24,455][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:04:24,960][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:04:25,468][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:04:25,971][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:04:26,476][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:04:26,980][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:04:27,480][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:04:27,988][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:04:28,495][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:04:29,010][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:04:29,518][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:04:30,022][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:04:30,528][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:04:31,035][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:04:31,546][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:04:32,053][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:04:32,559][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:04:33,063][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:04:33,572][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:04:34,084][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:04:34,589][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:04:35,093][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:04:35,601][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10155 tokens. [2025-11-13 07:04:36,488][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:33 [2025-11-13 07:04:37,120][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:04:37,122][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:04:37,123][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:04:38,044][__main__][INFO] - Iteration 565 took 1m 5s (43.64% Gen, 54.94% Train). Generation: 28s, Training: 35s. Estimated remaining time: 45h 20m 58s. Estimated total time: 54h 15m 54s. Time estimates for 10 more iterations: 10m 51s, 100 more iterations: 1h 48m 31s, 500 more iterations: 9h 2m 39s. [2025-11-13 07:04:38,046][__main__][INFO] - Starting iteration 565. [2025-11-13 07:04:38,536][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 07:04:38,537][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:05:05,112][__main__][INFO] - Number of regex retries in iteration 565: 0 [2025-11-13 07:05:05,113][__main__][INFO] - agents played in iteration 565 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:05:05,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:05:05,928][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:05:05,953][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:05:05,975][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:05:05,975][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:05:05,976][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:05:06,740][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:05:07,208][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:05:07,716][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:05:08,225][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:05:08,726][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:05:09,228][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:05:09,731][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:05:10,234][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:05:10,740][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:05:11,245][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:05:11,749][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:05:12,255][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:05:12,760][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:05:13,263][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:05:13,770][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:05:14,274][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:05:14,775][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:05:15,289][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:05:15,795][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:05:16,308][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:05:16,815][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:05:17,322][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:05:17,828][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:05:18,326][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:05:18,826][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:05:19,330][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:05:19,838][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:05:20,343][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:05:20,849][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:05:21,354][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:05:21,861][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:05:22,364][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:05:22,869][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:05:23,371][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:05:23,875][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:05:24,389][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:05:24,897][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:05:25,415][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:05:25,921][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:05:26,426][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:05:26,936][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:05:27,446][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:05:27,955][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:05:28,464][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:05:28,971][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:05:29,485][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:05:29,991][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:05:30,502][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:05:31,011][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:05:31,522][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:05:32,033][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:05:32,545][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:05:33,054][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:05:33,563][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:05:34,070][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:05:34,586][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:05:35,093][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:05:35,604][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:05:36,119][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:05:36,624][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:05:37,131][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:05:37,636][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:05:38,141][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:05:38,648][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:05:39,160][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10043 tokens. [2025-11-13 07:05:40,053][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 07:05:40,815][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:05:40,817][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:05:40,819][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:05:41,823][__main__][INFO] - Iteration 566 took 1m 3s (41.99% Gen, 56.42% Train). Generation: 26s, Training: 35s. Estimated remaining time: 43h 48m 20s. Estimated total time: 52h 44m 20s. Time estimates for 10 more iterations: 10m 32s, 100 more iterations: 1h 45m 28s, 500 more iterations: 8h 47m 23s. [2025-11-13 07:05:41,825][__main__][INFO] - Starting iteration 566. [2025-11-13 07:05:42,303][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 07:05:42,304][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:06:10,047][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:06:13,275][__main__][INFO] - Number of regex retries in iteration 566: 1 [2025-11-13 07:06:13,275][__main__][INFO] - agents played in iteration 566 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:06:14,263][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:06:14,288][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:06:14,312][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:06:14,334][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:06:14,335][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:06:14,336][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:06:15,123][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:06:15,596][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:06:16,107][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:06:16,614][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:06:17,119][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:06:17,623][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:06:18,136][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:06:18,639][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:06:19,146][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:06:19,651][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:06:20,155][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:06:20,670][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:06:21,181][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:06:21,685][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:06:22,204][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:06:22,713][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:06:23,226][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:06:23,732][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:06:24,239][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:06:24,745][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:06:25,261][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:06:25,788][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:06:26,297][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:06:26,804][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:06:27,328][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:06:27,833][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:06:28,336][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:06:28,839][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:06:29,343][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:06:29,851][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:06:30,355][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:06:30,857][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:06:31,367][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:06:31,874][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:06:32,382][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:06:32,888][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:06:33,392][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:06:33,912][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:06:34,417][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:06:34,930][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:06:35,434][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:06:35,942][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:06:36,448][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:06:36,956][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:06:37,465][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:06:37,970][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:06:38,474][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:06:38,980][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:06:39,485][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:06:39,989][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:06:40,506][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:06:41,014][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:06:41,540][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:06:42,044][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:06:42,552][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:06:43,056][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:06:43,565][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:06:44,075][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:06:44,580][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:06:45,086][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:06:45,610][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:06:46,111][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:06:46,613][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:06:47,117][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:06:47,622][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10101 tokens. [2025-11-13 07:06:48,438][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.17%, ΔTime: 00:00:33 [2025-11-13 07:06:49,071][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:06:49,073][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:06:49,075][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:06:49,982][__main__][INFO] - Iteration 567 took 1m 7s (45.76% Gen, 52.90% Train). Generation: 30s, Training: 35s. Estimated remaining time: 47h 26m 50s. Estimated total time: 56h 23m 59s. Time estimates for 10 more iterations: 11m 16s, 100 more iterations: 1h 52m 47s, 500 more iterations: 9h 23m 59s. [2025-11-13 07:06:49,984][__main__][INFO] - Starting iteration 567. [2025-11-13 07:06:50,460][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 07:06:50,461][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:07:04,741][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:07:15,604][__main__][INFO] - Number of regex retries in iteration 567: 1 [2025-11-13 07:07:15,605][__main__][INFO] - agents played in iteration 567 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:07:16,408][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:07:16,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:07:16,453][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:07:16,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:07:16,476][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:07:16,477][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:07:17,344][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:07:17,811][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:07:18,318][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:07:18,825][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:07:19,334][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:07:19,854][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:07:20,363][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:07:20,867][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:07:21,373][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:07:21,873][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:07:22,382][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:07:22,891][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:07:23,399][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:07:23,915][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:07:24,425][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:07:24,939][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:07:25,451][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:07:25,958][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:07:26,469][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:07:26,978][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:07:27,482][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:07:27,988][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:07:28,493][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:07:29,000][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:07:29,508][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:07:30,013][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:07:30,532][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:07:31,038][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:07:31,553][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:07:32,078][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:07:32,582][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:07:33,094][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:07:33,603][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:07:34,109][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:07:34,624][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:07:35,128][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:07:35,650][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:07:36,154][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:07:36,659][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:07:37,166][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:07:37,670][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:07:38,185][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:07:38,692][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:07:39,205][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:07:39,715][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:07:40,225][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:07:40,733][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:07:41,242][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:07:41,750][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:07:42,266][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:07:42,772][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:07:43,284][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:07:43,795][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:07:44,306][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:07:44,815][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:07:45,325][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:07:45,834][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:07:46,342][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:07:46,845][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:07:47,347][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:07:47,849][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:07:48,351][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:07:48,865][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:07:49,370][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:07:49,876][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10121 tokens. [2025-11-13 07:07:50,731][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:33 [2025-11-13 07:07:51,459][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:07:51,461][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:07:51,462][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:07:52,452][__main__][INFO] - Iteration 568 took 1m 1s (40.56% Gen, 57.84% Train). Generation: 25s, Training: 35s. Estimated remaining time: 42h 41m 26s. Estimated total time: 51h 39m 37s. Time estimates for 10 more iterations: 10m 19s, 100 more iterations: 1h 43m 19s, 500 more iterations: 8h 36m 36s. [2025-11-13 07:07:52,454][__main__][INFO] - Starting iteration 568. [2025-11-13 07:07:52,934][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 07:07:52,934][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:08:23,039][__main__][INFO] - Number of regex retries in iteration 568: 0 [2025-11-13 07:08:23,041][__main__][INFO] - agents played in iteration 568 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:08:23,881][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:08:23,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:08:23,936][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:08:23,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:08:23,960][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:08:23,962][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:08:24,744][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:08:25,203][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:08:25,711][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:08:26,211][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:08:26,711][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:08:27,217][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:08:27,718][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:08:28,221][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:08:28,736][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:08:29,243][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:08:29,760][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:08:30,272][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:08:30,777][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:08:31,291][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:08:31,796][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:08:32,304][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:08:32,811][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:08:33,317][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:08:33,818][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:08:34,316][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:08:34,814][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:08:35,316][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:08:35,822][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:08:36,328][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:08:36,834][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:08:37,342][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:08:37,868][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:08:38,373][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:08:38,878][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:08:39,386][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:08:39,895][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:08:40,409][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:08:40,916][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:08:41,420][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:08:41,927][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:08:42,433][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:08:42,941][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:08:43,446][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:08:43,952][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:08:44,465][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:08:44,974][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:08:45,484][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:08:45,992][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:08:46,502][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:08:47,010][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:08:47,517][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:08:48,025][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:08:48,542][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:08:49,048][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:08:49,551][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:08:50,060][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:08:50,565][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:08:51,076][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:08:51,583][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:08:52,087][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:08:52,601][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:08:53,108][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:08:53,614][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:08:54,119][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:08:54,628][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:08:55,133][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:08:55,643][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:08:56,148][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:08:56,658][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:08:57,164][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9982 tokens. [2025-11-13 07:08:58,068][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.10%, ΔTime: 00:00:33 [2025-11-13 07:08:58,696][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:08:58,698][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:08:58,700][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:08:59,626][__main__][INFO] - Iteration 569 took 1m 6s (45.14% Gen, 53.47% Train). Generation: 30s, Training: 35s. Estimated remaining time: 46h 35m 19s. Estimated total time: 55h 34m 37s. Time estimates for 10 more iterations: 11m 6s, 100 more iterations: 1h 51m 9s, 500 more iterations: 9h 15m 46s. [2025-11-13 07:08:59,628][__main__][INFO] - Starting iteration 569. [2025-11-13 07:09:00,103][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 07:09:00,103][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:09:19,496][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:09:28,055][__main__][INFO] - Number of regex retries in iteration 569: 1 [2025-11-13 07:09:28,056][__main__][INFO] - agents played in iteration 569 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:09:28,842][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:09:28,868][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:09:28,892][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:09:28,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:09:28,915][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:09:28,916][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:09:29,720][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:09:30,178][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:09:30,693][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:09:31,193][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:09:31,697][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:09:32,203][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:09:32,706][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:09:33,209][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:09:33,719][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:09:34,226][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:09:34,729][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:09:35,236][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:09:35,738][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:09:36,244][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:09:36,753][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:09:37,259][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:09:37,777][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:09:38,284][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:09:38,786][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:09:39,289][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:09:39,793][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:09:40,305][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:09:40,811][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:09:41,321][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:09:41,828][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:09:42,333][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:09:42,840][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:09:43,344][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:09:43,849][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:09:44,356][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:09:44,869][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:09:45,376][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:09:45,884][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:09:46,389][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:09:46,900][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:09:47,404][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:09:47,909][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:09:48,415][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:09:48,921][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:09:49,427][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:09:49,930][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:09:50,433][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:09:50,941][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:09:51,445][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:09:51,952][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:09:52,467][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:09:52,970][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:09:53,483][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:09:53,986][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:09:54,495][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:09:56,090][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:09:56,611][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:09:57,120][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:09:57,629][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:09:58,138][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:09:58,645][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:09:59,153][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:09:59,655][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:10:00,168][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:10:00,669][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:10:01,172][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:10:01,680][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:10:02,184][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:10:02,689][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:10:03,193][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10049 tokens. [2025-11-13 07:10:04,056][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.07%, Current % of VRAM taken: 58.32%, Block Peak % of device VRAM: 62.11%, ΔTime: 00:00:34 [2025-11-13 07:10:04,719][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:10:04,721][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:10:04,722][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:10:05,720][__main__][INFO] - Iteration 570 took 1m 5s (42.60% Gen, 55.88% Train). Generation: 27s, Training: 36s. Estimated remaining time: 45h 40m 32s. Estimated total time: 54h 40m 56s. Time estimates for 10 more iterations: 10m 56s, 100 more iterations: 1h 49m 21s, 500 more iterations: 9h 6m 49s. [2025-11-13 07:10:05,723][__main__][INFO] - Starting iteration 570. [2025-11-13 07:10:06,206][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 56 and human policies 1. [2025-11-13 07:10:06,207][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:10:38,171][__main__][INFO] - Number of regex retries in iteration 570: 0 [2025-11-13 07:10:38,172][__main__][INFO] - agents played in iteration 570 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:10:39,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:10:39,051][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:10:39,074][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:10:39,097][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:10:39,098][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:10:39,098][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:10:39,938][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:10:40,401][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:10:40,919][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:10:41,430][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:10:41,941][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:10:42,449][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:10:42,956][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:10:43,466][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:10:43,974][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:10:44,482][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:10:44,991][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:10:45,503][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:10:46,017][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:10:46,523][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:10:47,029][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:10:47,537][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:10:48,043][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:10:48,554][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:10:49,066][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:10:49,574][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:10:50,098][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:10:50,606][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:10:51,115][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:10:51,624][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:10:52,131][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:10:52,642][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:10:53,151][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:10:53,664][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:10:54,199][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:10:54,707][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:10:55,224][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:10:55,736][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:10:56,243][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:10:56,750][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:10:57,254][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:10:57,771][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:10:58,274][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:10:58,780][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:10:59,284][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:10:59,789][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:11:00,298][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:11:00,806][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:11:01,314][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:11:01,819][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:11:02,323][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:11:02,828][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:11:03,332][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:11:03,837][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:11:04,342][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:11:04,847][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:11:05,349][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:11:05,854][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:11:06,358][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:11:06,877][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:11:07,379][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:11:07,880][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:11:08,386][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:11:08,884][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:11:09,395][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:11:09,896][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:11:10,401][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:11:10,902][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:11:11,408][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:11:11,915][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:11:12,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10151 tokens. [2025-11-13 07:11:13,286][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.32%, Block Peak % of device VRAM: 62.36%, ΔTime: 00:00:33 [2025-11-13 07:11:14,054][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:11:14,055][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:11:14,057][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:11:15,918][__main__][INFO] - Iteration 571 took 1m 9s (45.85% Gen, 51.48% Train). Generation: 31s, Training: 35s. Estimated remaining time: 49h 4m 1s. Estimated total time: 58h 5m 35s. Time estimates for 10 more iterations: 11m 37s, 100 more iterations: 1h 56m 11s, 500 more iterations: 9h 40m 55s. [2025-11-13 07:11:15,920][__main__][INFO] - Starting iteration 571. [2025-11-13 07:11:16,397][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 07:11:16,398][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:11:43,160][__main__][INFO] - Number of regex retries in iteration 571: 0 [2025-11-13 07:11:43,161][__main__][INFO] - agents played in iteration 571 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:11:43,932][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:11:43,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:11:43,986][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:11:44,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:11:44,009][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:11:44,010][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:11:44,790][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:11:45,250][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:11:45,759][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:11:46,267][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:11:46,771][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:11:47,281][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:11:47,790][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:11:48,296][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:11:48,801][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:11:49,307][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:11:49,817][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:11:50,321][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:11:50,824][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:11:51,329][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:11:51,837][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:11:52,344][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:11:52,849][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:11:53,357][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:11:53,879][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:11:54,386][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:11:54,891][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:11:55,404][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:11:55,913][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:11:56,426][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:11:56,931][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:11:57,435][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:11:57,942][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:11:58,451][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:11:58,956][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:11:59,472][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:11:59,981][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:12:00,500][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:12:01,008][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:12:01,519][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:12:02,030][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:12:02,538][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:12:03,045][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:12:04,246][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:12:05,402][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:12:05,912][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:12:06,421][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:12:06,925][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:12:07,431][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:12:07,935][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:12:08,441][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:12:08,947][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:12:09,454][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:12:09,962][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:12:10,466][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:12:10,971][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:12:11,475][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:12:11,978][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:12:12,498][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:12:13,003][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:12:13,509][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:12:14,017][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:12:14,522][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:12:15,029][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:12:15,532][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:12:16,034][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:12:16,541][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:12:17,044][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:12:17,547][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:12:18,055][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:12:18,557][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10045 tokens. [2025-11-13 07:12:19,445][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.22%, ΔTime: 00:00:34 [2025-11-13 07:12:20,092][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:12:20,094][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:12:20,096][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:12:20,945][__main__][INFO] - Iteration 572 took 1m 4s (41.46% Gen, 57.22% Train). Generation: 26s, Training: 36s. Estimated remaining time: 44h 44m 48s. Estimated total time: 53h 47m 27s. Time estimates for 10 more iterations: 10m 45s, 100 more iterations: 1h 47m 34s, 500 more iterations: 8h 57m 54s. [2025-11-13 07:12:20,948][__main__][INFO] - Starting iteration 572. [2025-11-13 07:12:21,455][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 07:12:21,456][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:12:49,079][__main__][INFO] - Number of regex retries in iteration 572: 0 [2025-11-13 07:12:49,080][__main__][INFO] - agents played in iteration 572 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:12:50,005][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:12:50,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:12:50,059][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:12:50,082][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:12:50,082][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:12:50,083][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:12:50,872][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:12:51,332][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:12:51,843][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:12:52,347][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:12:52,855][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:12:53,362][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:12:53,864][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:12:54,368][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:12:54,873][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:12:55,379][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:12:55,887][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:12:56,393][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:12:56,897][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:12:57,414][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:12:57,918][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:12:58,421][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:12:58,929][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:12:59,434][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:12:59,944][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:13:00,450][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:13:00,958][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:13:01,473][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:13:01,979][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:13:02,499][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:13:03,006][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:13:03,512][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:13:04,023][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:13:04,528][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:13:05,037][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:13:05,545][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:13:06,051][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:13:06,553][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:13:07,059][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:13:07,563][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:13:08,078][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:13:08,581][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:13:09,104][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:13:09,609][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:13:10,117][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:13:10,624][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:13:11,130][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:13:11,636][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:13:12,141][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:13:12,646][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:13:13,152][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:13:13,660][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:13:14,166][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:13:14,670][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:13:15,175][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:13:15,694][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:13:16,199][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:13:16,704][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:13:17,210][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:13:17,715][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:13:18,222][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:13:18,728][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:13:19,233][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:13:19,748][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:13:20,253][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:13:20,760][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:13:21,265][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:13:21,774][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:13:22,291][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:13:22,801][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:13:23,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9995 tokens. [2025-11-13 07:13:24,194][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.18%, ΔTime: 00:00:33 [2025-11-13 07:13:24,948][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:13:24,950][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:13:24,952][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:13:25,867][__main__][INFO] - Iteration 573 took 1m 4s (42.89% Gen, 55.69% Train). Generation: 27s, Training: 35s. Estimated remaining time: 44h 36m 54s. Estimated total time: 53h 40m 38s. Time estimates for 10 more iterations: 10m 44s, 100 more iterations: 1h 47m 21s, 500 more iterations: 8h 56m 46s. [2025-11-13 07:13:25,870][__main__][INFO] - Starting iteration 573. [2025-11-13 07:13:26,355][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 07:13:26,356][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:13:44,446][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:13:51,863][__main__][INFO] - Number of regex retries in iteration 573: 1 [2025-11-13 07:13:51,863][__main__][INFO] - agents played in iteration 573 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:13:52,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:13:52,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:13:52,739][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:13:52,761][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:13:52,762][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:13:52,763][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:13:53,520][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:13:53,980][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:13:54,486][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:13:54,986][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:13:55,492][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:13:55,992][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:13:56,508][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:13:57,009][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:13:57,509][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:13:58,016][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:13:58,516][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:13:59,023][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:13:59,526][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:14:00,028][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:14:00,534][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:14:01,033][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:14:01,535][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:14:02,043][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:14:02,546][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:14:03,051][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:14:03,559][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:14:04,065][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:14:04,595][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:14:05,108][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:14:05,627][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:14:06,134][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:14:06,642][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:14:07,153][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:14:07,662][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:14:08,168][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:14:08,675][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:14:09,184][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:14:09,691][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:14:10,203][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:14:10,710][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:14:11,218][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:14:11,726][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:14:12,231][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:14:12,737][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:14:13,244][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:14:13,750][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:14:14,260][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:14:14,773][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:14:15,280][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:14:15,789][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:14:16,308][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:14:16,818][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:14:17,334][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:14:17,841][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:14:18,346][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:14:18,853][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:14:20,686][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:14:21,229][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:14:21,734][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:14:22,244][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:14:22,751][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:14:23,258][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:14:23,758][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:14:24,265][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:14:24,776][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:14:25,281][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:14:25,787][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:14:26,286][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:14:26,793][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:14:27,305][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10124 tokens. [2025-11-13 07:14:28,169][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.43%, ΔTime: 00:00:34 [2025-11-13 07:14:28,831][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:14:28,837][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:14:28,840][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:14:29,723][__main__][INFO] - Iteration 574 took 1m 3s (40.25% Gen, 58.35% Train). Generation: 25s, Training: 36s. Estimated remaining time: 43h 43m 37s. Estimated total time: 52h 48m 25s. Time estimates for 10 more iterations: 10m 33s, 100 more iterations: 1h 45m 36s, 500 more iterations: 8h 48m 4s. [2025-11-13 07:14:29,727][__main__][INFO] - Starting iteration 574. [2025-11-13 07:14:30,213][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 07:14:30,214][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:15:03,118][__main__][INFO] - Number of regex retries in iteration 574: 0 [2025-11-13 07:15:03,118][__main__][INFO] - agents played in iteration 574 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:15:03,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:15:03,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:15:03,998][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:15:04,021][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:15:04,022][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:15:04,022][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:15:04,804][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:15:05,264][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:15:05,780][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:15:06,283][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:15:06,788][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:15:07,300][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:15:07,806][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:15:08,340][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:15:08,852][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:15:09,358][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:15:09,859][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:15:10,367][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:15:10,872][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:15:11,380][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:15:11,886][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:15:12,397][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:15:12,904][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:15:13,417][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:15:13,931][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:15:14,437][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:15:14,953][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:15:15,461][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:15:15,971][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:15:16,477][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:15:16,989][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:15:17,498][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:15:18,003][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:15:18,510][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:15:19,027][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:15:19,536][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:15:20,044][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:15:20,549][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:15:21,052][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:15:21,560][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:15:22,065][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:15:22,570][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:15:23,078][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:15:23,584][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:15:24,091][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:15:24,606][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:15:25,112][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:15:25,640][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:15:26,138][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:15:26,647][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:15:27,160][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:15:27,669][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:15:28,168][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:15:28,670][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:15:29,178][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:15:29,687][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:15:30,193][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:15:30,702][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:15:31,221][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:15:31,726][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:15:32,247][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:15:32,753][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:15:33,262][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:15:33,766][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:15:34,271][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:15:34,783][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:15:35,287][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:15:35,792][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:15:36,300][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:15:36,806][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:15:37,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10197 tokens. [2025-11-13 07:15:38,125][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 07:15:38,888][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:15:38,890][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:15:38,891][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:15:39,812][__main__][INFO] - Iteration 575 took 1m 9s (47.28% Gen, 51.40% Train). Generation: 32s, Training: 35s. Estimated remaining time: 48h 54m 0s. Estimated total time: 57h 59m 58s. Time estimates for 10 more iterations: 11m 35s, 100 more iterations: 1h 55m 59s, 500 more iterations: 9h 39m 59s. [2025-11-13 07:15:39,814][__main__][INFO] - Starting iteration 575. [2025-11-13 07:15:40,295][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 07:15:40,295][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:16:12,063][__main__][INFO] - Number of regex retries in iteration 575: 0 [2025-11-13 07:16:12,064][__main__][INFO] - agents played in iteration 575 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:16:12,908][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:16:12,935][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:16:12,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:16:12,986][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:16:12,987][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:16:12,987][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:16:13,738][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:16:14,194][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:16:14,703][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:16:15,203][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:16:15,706][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:16:16,207][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:16:16,717][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:16:17,222][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:16:17,726][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:16:18,230][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:16:18,746][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:16:19,246][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:16:19,766][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:16:20,269][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:16:20,774][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:16:21,288][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:16:21,791][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:16:22,306][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:16:22,815][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:16:23,320][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:16:23,826][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:16:24,331][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:16:24,840][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:16:25,351][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:16:25,856][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:16:26,372][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:16:26,879][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:16:27,387][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:16:27,896][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:16:28,405][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:16:28,916][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:16:29,423][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:16:29,930][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:16:30,459][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:16:30,964][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:16:31,472][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:16:31,974][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:16:32,482][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:16:32,991][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:16:33,502][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:16:34,012][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:16:34,519][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:16:35,028][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:16:35,540][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:16:36,051][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:16:36,557][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:16:37,064][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:16:37,574][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:16:38,081][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:16:38,583][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:16:39,086][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:16:39,595][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:16:40,100][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:16:40,603][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:16:41,106][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:16:41,608][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:16:42,121][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:16:42,629][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:16:43,134][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:16:43,642][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:16:44,147][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:16:44,650][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:16:45,159][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:16:45,665][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:16:46,172][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10138 tokens. [2025-11-13 07:16:47,027][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.05%, ΔTime: 00:00:33 [2025-11-13 07:16:47,800][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:16:47,802][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:16:47,804][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:16:48,795][__main__][INFO] - Iteration 576 took 1m 8s (46.37% Gen, 52.18% Train). Generation: 31s, Training: 35s. Estimated remaining time: 47h 57m 57s. Estimated total time: 57h 5m 4s. Time estimates for 10 more iterations: 11m 25s, 100 more iterations: 1h 54m 10s, 500 more iterations: 9h 30m 50s. [2025-11-13 07:16:48,797][__main__][INFO] - Starting iteration 576. [2025-11-13 07:16:49,268][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 07:16:49,268][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:16:57,968][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:17:09,097][__main__][INFO] - Number of regex retries in iteration 576: 1 [2025-11-13 07:17:09,098][__main__][INFO] - agents played in iteration 576 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:17:09,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:17:09,910][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:17:09,936][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:17:09,959][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:17:09,960][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:17:09,961][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:17:10,699][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:17:11,166][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:17:11,669][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:17:12,170][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:17:12,672][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:17:13,172][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:17:13,672][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:17:14,171][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:17:14,675][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:17:15,182][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:17:15,688][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:17:16,190][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:17:16,694][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:17:17,194][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:17:17,696][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:17:18,203][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:17:18,704][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:17:19,208][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:17:19,711][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:17:20,213][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:17:20,716][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:17:21,229][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:17:21,739][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:17:22,245][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:17:22,752][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:17:23,258][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:17:23,768][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:17:24,276][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:17:24,783][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:17:25,296][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:17:25,804][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:17:27,691][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:17:28,219][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:17:28,731][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:17:29,237][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:17:29,756][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:17:30,262][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:17:30,769][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:17:31,272][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:17:31,779][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:17:32,284][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:17:32,792][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:17:33,300][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:17:33,809][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:17:34,316][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:17:34,822][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:17:35,330][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:17:35,840][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:17:36,368][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:17:36,876][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:17:37,386][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:17:37,890][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:17:38,395][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:17:38,907][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:17:39,411][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:17:39,916][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:17:40,425][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:17:40,930][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:17:41,436][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:17:41,943][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:17:42,448][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:17:42,967][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:17:43,471][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:17:43,977][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:17:44,482][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10147 tokens. [2025-11-13 07:17:45,400][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.07%, Current % of VRAM taken: 58.32%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:34 [2025-11-13 07:17:46,050][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:17:46,053][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:17:46,055][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:17:46,880][__main__][INFO] - Iteration 577 took 57s (34.42% Gen, 64.15% Train). Generation: 19s, Training: 36s. Estimated remaining time: 38h 52m 33s. Estimated total time: 48h 0m 38s. Time estimates for 10 more iterations: 9m 36s, 100 more iterations: 1h 36m 1s, 500 more iterations: 8h 0m 6s. [2025-11-13 07:17:46,882][__main__][INFO] - Starting iteration 577. [2025-11-13 07:17:47,368][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 07:17:47,369][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:18:20,971][__main__][INFO] - Number of regex retries in iteration 577: 0 [2025-11-13 07:18:20,972][__main__][INFO] - agents played in iteration 577 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:18:21,822][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:18:21,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:18:21,875][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:18:21,898][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:18:21,898][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:18:21,900][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:18:22,648][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:18:23,103][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:18:23,610][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:18:24,110][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:18:24,613][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:18:25,112][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:18:25,613][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:18:26,114][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:18:26,613][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:18:27,112][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:18:27,613][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:18:28,118][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:18:28,629][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:18:29,131][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:18:29,632][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:18:30,148][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:18:30,649][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:18:31,164][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:18:31,670][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:18:32,176][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:18:32,689][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:18:33,195][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:18:33,704][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:18:34,217][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:18:34,725][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:18:35,241][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:18:35,750][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:18:36,253][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:18:36,760][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:18:37,266][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:18:37,774][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:18:38,278][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:18:38,787][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:18:39,299][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:18:39,826][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:18:40,342][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:18:40,846][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:18:41,354][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:18:41,866][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:18:42,375][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:18:42,885][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:18:43,396][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:18:43,905][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:18:44,412][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:18:44,918][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:18:45,432][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:18:45,942][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:18:46,447][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:18:46,967][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:18:47,472][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:18:47,979][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:18:48,484][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:18:48,990][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:18:49,496][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:18:50,008][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:18:50,512][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:18:51,029][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:18:51,535][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:18:52,045][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:18:52,551][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:18:53,056][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:18:53,576][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:18:54,085][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:18:54,593][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:18:55,098][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10137 tokens. [2025-11-13 07:18:56,095][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.47%, ΔTime: 00:00:33 [2025-11-13 07:18:56,842][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:18:56,844][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:18:56,846][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:18:57,807][__main__][INFO] - Iteration 578 took 1m 10s (47.70% Gen, 50.93% Train). Generation: 33s, Training: 35s. Estimated remaining time: 49h 32m 44s. Estimated total time: 58h 42m 0s. Time estimates for 10 more iterations: 11m 44s, 100 more iterations: 1h 57m 24s, 500 more iterations: 9h 47m 0s. [2025-11-13 07:18:57,810][__main__][INFO] - Starting iteration 578. [2025-11-13 07:18:58,290][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 07:18:58,291][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:19:26,342][__main__][INFO] - Number of regex retries in iteration 578: 0 [2025-11-13 07:19:26,342][__main__][INFO] - agents played in iteration 578 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:19:27,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:19:27,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:19:27,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:19:27,246][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:19:27,246][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:19:27,247][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:19:28,015][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:19:28,476][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:19:28,988][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:19:29,492][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:19:30,001][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:19:30,505][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:19:31,008][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:19:31,510][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:19:32,017][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:19:32,523][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:19:33,026][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:19:33,529][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:19:34,033][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:19:34,546][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:19:35,050][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:19:35,568][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:19:36,079][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:19:36,583][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:19:37,092][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:19:37,594][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:19:38,097][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:19:38,605][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:19:39,110][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:19:39,619][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:19:40,125][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:19:40,630][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:19:41,135][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:19:41,641][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:19:42,161][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:19:42,668][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:19:43,175][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:19:43,680][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:19:44,186][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:19:44,696][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:19:45,205][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:19:45,710][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:19:46,217][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:19:46,723][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:19:47,228][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:19:47,732][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:19:48,236][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:19:48,750][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:19:49,256][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:19:49,765][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:19:50,275][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:19:50,784][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:19:51,292][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:19:51,797][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:19:52,302][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:19:52,813][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:19:53,322][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:19:53,826][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:19:54,333][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:19:54,837][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:19:55,350][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:19:55,858][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:19:56,365][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:19:56,874][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:19:57,381][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:19:57,894][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:19:58,403][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:19:58,911][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:19:59,421][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:20:00,727][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:20:01,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10090 tokens. [2025-11-13 07:20:02,603][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.31%, Current % of VRAM taken: 58.55%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:34 [2025-11-13 07:20:03,546][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:20:03,547][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:20:03,549][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:20:04,380][__main__][INFO] - Iteration 579 took 1m 6s (42.44% Gen, 56.30% Train). Generation: 28s, Training: 37s. Estimated remaining time: 45h 54m 10s. Estimated total time: 55h 4m 33s. Time estimates for 10 more iterations: 11m 0s, 100 more iterations: 1h 50m 9s, 500 more iterations: 9h 10m 45s. [2025-11-13 07:20:04,383][__main__][INFO] - Starting iteration 579. [2025-11-13 07:20:04,871][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 07:20:04,872][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:20:32,030][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:20:34,880][__main__][INFO] - Number of regex retries in iteration 579: 1 [2025-11-13 07:20:34,881][__main__][INFO] - agents played in iteration 579 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:20:35,834][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:20:35,856][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:20:35,878][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:20:35,900][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:20:35,901][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:20:35,902][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:20:36,717][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:20:37,190][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:20:37,702][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:20:38,213][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:20:38,719][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:20:39,223][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:20:39,735][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:20:40,239][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:20:40,745][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:20:41,257][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:20:41,764][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:20:42,268][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:20:42,770][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:20:43,273][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:20:43,794][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:20:44,298][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:20:44,802][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:20:45,314][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:20:45,822][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:20:46,325][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:20:46,833][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:20:47,343][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:20:47,851][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:20:48,361][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:20:48,865][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:20:49,375][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:20:49,884][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:20:50,407][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:20:50,916][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:20:51,423][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:20:51,931][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:20:52,435][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:20:52,942][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:20:53,451][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:20:53,960][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:20:54,485][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:20:54,995][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:20:55,512][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:20:56,025][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:20:56,534][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:20:57,044][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:20:57,548][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:20:58,057][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:20:58,560][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:20:59,063][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:20:59,585][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:21:00,089][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:21:00,597][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:21:01,106][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:21:01,609][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:21:02,120][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:21:02,625][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:21:03,131][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:21:03,635][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:21:04,140][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:21:04,646][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:21:05,150][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:21:05,656][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:21:06,174][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:21:06,678][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:21:07,192][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:21:07,694][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:21:08,201][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:21:08,711][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:21:09,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10122 tokens. [2025-11-13 07:21:10,037][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 07:21:10,803][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:21:10,805][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:21:10,806][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:21:11,760][__main__][INFO] - Iteration 580 took 1m 6s (44.86% Gen, 53.71% Train). Generation: 30s, Training: 35s. Estimated remaining time: 46h 32m 58s. Estimated total time: 55h 44m 28s. Time estimates for 10 more iterations: 11m 8s, 100 more iterations: 1h 51m 28s, 500 more iterations: 9h 17m 24s. [2025-11-13 07:21:11,762][__main__][INFO] - Starting iteration 580. [2025-11-13 07:21:12,242][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 57 and human policies 1. [2025-11-13 07:21:12,243][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:21:39,072][__main__][INFO] - Number of regex retries in iteration 580: 0 [2025-11-13 07:21:39,072][__main__][INFO] - agents played in iteration 580 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:21:39,848][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:21:39,876][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:21:39,903][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:21:39,927][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:21:39,928][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:21:39,929][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:21:40,692][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:21:41,154][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:21:41,667][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:21:42,174][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:21:42,679][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:21:43,199][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:21:43,717][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:21:44,224][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:21:44,729][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:21:45,240][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:21:45,762][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:21:46,271][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:21:46,772][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:21:47,280][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:21:47,780][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:21:48,287][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:21:48,797][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:21:49,304][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:21:49,811][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:21:50,317][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:21:50,836][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:21:51,347][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:21:51,854][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:21:52,363][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:21:52,876][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:21:53,396][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:21:53,906][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:21:54,417][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:21:54,925][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:21:55,434][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:21:55,948][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:21:56,455][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:21:56,960][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:21:57,467][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:21:57,975][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:21:58,479][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:21:58,985][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:21:59,490][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:22:00,008][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:22:00,517][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:22:01,027][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:22:01,532][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:22:02,043][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:22:02,553][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:22:03,055][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:22:03,556][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:22:04,060][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:22:04,568][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:22:05,072][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:22:05,580][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:22:06,087][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:22:06,612][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:22:07,116][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:22:07,620][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:22:08,125][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:22:08,633][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:22:09,141][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:22:09,639][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:22:10,134][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:22:10,633][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:22:11,127][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:22:11,628][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:22:12,139][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:22:12,634][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:22:13,130][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9945 tokens. [2025-11-13 07:22:13,933][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.92%, Current % of VRAM taken: 58.17%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:33 [2025-11-13 07:22:14,700][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:22:14,702][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:22:14,704][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:22:16,532][__main__][INFO] - Iteration 581 took 1m 4s (41.73% Gen, 55.42% Train). Generation: 26s, Training: 35s. Estimated remaining time: 44h 21m 55s. Estimated total time: 53h 34m 30s. Time estimates for 10 more iterations: 10m 42s, 100 more iterations: 1h 47m 9s, 500 more iterations: 8h 55m 45s. [2025-11-13 07:22:16,534][__main__][INFO] - Starting iteration 581. [2025-11-13 07:22:17,011][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 07:22:17,012][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:22:30,009][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:22:40,358][__main__][INFO] - Number of regex retries in iteration 581: 1 [2025-11-13 07:22:40,358][__main__][INFO] - agents played in iteration 581 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:22:41,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:22:41,148][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:22:41,171][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:22:41,193][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:22:41,193][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:22:41,195][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:22:42,008][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:22:42,468][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:22:42,982][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:22:43,488][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:22:43,994][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:22:44,497][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:22:45,005][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:22:45,518][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:22:46,037][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:22:46,541][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:22:47,047][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:22:47,556][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:22:48,061][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:22:48,568][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:22:49,073][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:22:49,577][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:22:50,082][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:22:50,586][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:22:51,091][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:22:51,595][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:22:52,098][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:22:52,602][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:22:53,106][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:22:53,616][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:22:54,135][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:22:54,640][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:22:55,148][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:22:55,655][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:22:56,163][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:22:56,672][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:22:57,186][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:22:57,692][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:22:58,210][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:22:59,123][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:23:00,666][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:23:01,180][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:23:01,686][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:23:02,205][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:23:02,711][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:23:03,229][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:23:03,742][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:23:04,244][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:23:04,750][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:23:05,251][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:23:05,756][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:23:06,256][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:23:06,756][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:23:07,258][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:23:07,761][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:23:08,266][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:23:08,773][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:23:09,284][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:23:09,790][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:23:10,296][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:23:10,803][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:23:11,312][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:23:11,821][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:23:12,324][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:23:12,829][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:23:13,337][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:23:13,845][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:23:14,350][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:23:14,852][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:23:15,353][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:23:15,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10053 tokens. [2025-11-13 07:23:16,743][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.04%, Current % of VRAM taken: 58.29%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:34 [2025-11-13 07:23:17,407][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:23:17,409][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:23:17,411][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:23:18,232][__main__][INFO] - Iteration 582 took 1m 1s (38.13% Gen, 60.52% Train). Generation: 23s, Training: 37s. Estimated remaining time: 41h 47m 29s. Estimated total time: 51h 1m 6s. Time estimates for 10 more iterations: 10m 12s, 100 more iterations: 1h 42m 2s, 500 more iterations: 8h 30m 11s. [2025-11-13 07:23:18,235][__main__][INFO] - Starting iteration 582. [2025-11-13 07:23:18,726][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 07:23:18,726][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:23:50,947][__main__][INFO] - Number of regex retries in iteration 582: 0 [2025-11-13 07:23:50,948][__main__][INFO] - agents played in iteration 582 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:23:51,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:23:51,777][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:23:51,803][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:23:51,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:23:51,827][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:23:51,829][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:23:52,700][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:23:53,183][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:23:53,697][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:23:54,211][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:23:54,727][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:23:55,240][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:23:55,760][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:23:56,268][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:23:56,775][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:23:57,283][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:23:57,789][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:23:58,311][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:23:58,812][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:23:59,318][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:23:59,824][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:24:00,330][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:24:00,836][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:24:01,342][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:24:01,853][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:24:02,364][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:24:02,871][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:24:03,378][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:24:03,888][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:24:04,399][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:24:04,912][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:24:05,419][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:24:05,929][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:24:06,435][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:24:06,944][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:24:07,449][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:24:07,958][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:24:08,470][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:24:08,979][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:24:09,484][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:24:09,987][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:24:10,493][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:24:11,001][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:24:11,499][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:24:12,003][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:24:12,502][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:24:13,005][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:24:13,510][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:24:14,014][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:24:14,517][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:24:15,022][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:24:15,542][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:24:16,048][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:24:16,557][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:24:17,064][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:24:17,569][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:24:18,077][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:24:18,587][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:24:19,092][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:24:19,599][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:24:20,105][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:24:20,609][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:24:21,113][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:24:21,617][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:24:22,130][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:24:22,636][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:24:23,141][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:24:23,641][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:24:24,145][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:24:24,644][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:24:25,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9953 tokens. [2025-11-13 07:24:25,935][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.95%, Current % of VRAM taken: 58.20%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 07:24:26,706][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:24:26,707][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:24:26,709][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:24:27,693][__main__][INFO] - Iteration 583 took 1m 8s (46.72% Gen, 51.85% Train). Generation: 32s, Training: 35s. Estimated remaining time: 48h 13m 36s. Estimated total time: 57h 28m 22s. Time estimates for 10 more iterations: 11m 29s, 100 more iterations: 1h 54m 56s, 500 more iterations: 9h 34m 43s. [2025-11-13 07:24:27,695][__main__][INFO] - Starting iteration 583. [2025-11-13 07:24:28,183][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 07:24:28,184][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:24:53,348][__main__][INFO] - Number of regex retries in iteration 583: 0 [2025-11-13 07:24:53,349][__main__][INFO] - agents played in iteration 583 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:24:54,132][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:24:54,160][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:24:54,186][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:24:54,208][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:24:54,209][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:24:54,210][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:24:55,003][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:24:55,468][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:24:56,058][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:24:56,567][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:24:57,067][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:24:57,575][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:24:58,082][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:24:58,591][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:24:59,098][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:24:59,607][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:25:00,114][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:25:00,624][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:25:01,131][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:25:01,638][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:25:02,145][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:25:02,654][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:25:03,158][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:25:03,663][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:25:04,172][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:25:04,678][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:25:05,184][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:25:05,690][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:25:06,199][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:25:06,707][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:25:07,210][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:25:07,709][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:25:08,223][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:25:08,722][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:25:09,224][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:25:09,728][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:25:10,222][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:25:10,733][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:25:11,239][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:25:11,743][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:25:12,245][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:25:12,747][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:25:13,258][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:25:13,764][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:25:14,265][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:25:14,766][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:25:15,268][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:25:15,776][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:25:16,281][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:25:16,784][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:25:17,290][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:25:17,796][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:25:18,306][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:25:18,812][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:25:19,311][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:25:19,807][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:25:20,307][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:25:20,808][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:25:21,327][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:25:21,833][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:25:22,346][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:25:22,848][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:25:23,347][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:25:23,858][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:25:24,357][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:25:24,860][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:25:25,361][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:25:25,860][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:25:26,366][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:25:26,867][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:25:27,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9994 tokens. [2025-11-13 07:25:28,146][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:33 [2025-11-13 07:25:28,931][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:25:28,934][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:25:28,936][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:25:29,841][__main__][INFO] - Iteration 584 took 1m 1s (40.81% Gen, 57.72% Train). Generation: 25s, Training: 35s. Estimated remaining time: 42h 7m 6s. Estimated total time: 51h 22m 54s. Time estimates for 10 more iterations: 10m 16s, 100 more iterations: 1h 42m 45s, 500 more iterations: 8h 33m 49s. [2025-11-13 07:25:29,843][__main__][INFO] - Starting iteration 584. [2025-11-13 07:25:30,314][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 07:25:30,314][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:25:43,331][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:25:50,460][__main__][INFO] - Number of regex retries in iteration 584: 1 [2025-11-13 07:25:50,460][__main__][INFO] - agents played in iteration 584 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:25:51,253][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:25:51,276][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:25:51,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:25:51,320][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:25:51,321][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:25:51,322][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:25:52,145][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:25:52,604][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:25:53,117][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:25:53,619][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:25:54,135][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:25:54,643][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:25:55,148][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:25:55,655][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:25:56,163][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:25:56,664][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:25:57,170][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:25:57,674][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:25:58,186][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:25:58,691][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:25:59,196][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:25:59,703][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:26:00,211][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:26:00,717][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:26:01,222][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:26:01,728][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:26:02,236][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:26:02,742][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:26:03,249][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:26:03,753][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:26:04,260][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:26:04,778][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:26:05,276][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:26:05,785][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:26:06,281][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:26:06,774][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:26:07,279][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:26:07,773][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:26:08,268][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:26:08,783][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:26:09,289][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:26:09,793][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:26:10,288][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:26:10,783][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:26:11,283][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:26:11,780][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:26:12,285][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:26:12,783][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:26:13,291][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:26:13,793][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:26:14,302][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:26:14,801][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:26:15,300][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:26:15,799][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:26:16,302][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:26:16,805][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:26:17,308][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:26:17,817][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:26:18,321][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:26:18,824][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:26:19,332][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:26:19,836][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:26:20,335][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:26:20,835][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:26:21,330][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:26:21,824][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:26:22,326][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:26:22,827][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:26:23,341][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:26:23,840][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:26:24,338][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9916 tokens. [2025-11-13 07:26:25,121][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.90%, Current % of VRAM taken: 58.15%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:32 [2025-11-13 07:26:25,911][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:26:25,913][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:26:25,915][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:26:26,862][__main__][INFO] - Iteration 585 took 56s (35.63% Gen, 62.70% Train). Generation: 20s, Training: 35s. Estimated remaining time: 37h 50m 40s. Estimated total time: 47h 7m 25s. Time estimates for 10 more iterations: 9m 25s, 100 more iterations: 1h 34m 14s, 500 more iterations: 7h 51m 14s. [2025-11-13 07:26:26,864][__main__][INFO] - Starting iteration 585. [2025-11-13 07:26:27,333][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 07:26:27,333][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:26:52,713][__main__][INFO] - Number of regex retries in iteration 585: 0 [2025-11-13 07:26:52,714][__main__][INFO] - agents played in iteration 585 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:26:53,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:26:53,621][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:26:53,652][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:26:53,678][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:26:53,679][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:26:53,680][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:26:54,569][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:26:55,032][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:26:55,542][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:26:56,060][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:26:56,564][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:26:57,075][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:26:57,580][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:26:58,086][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:26:58,592][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:26:59,100][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:26:59,607][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:27:00,117][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:27:00,623][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:27:01,130][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:27:01,635][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:27:02,148][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:27:02,661][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:27:03,165][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:27:03,679][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:27:04,181][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:27:04,689][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:27:05,196][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:27:05,701][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:27:06,202][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:27:06,702][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:27:07,204][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:27:07,711][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:27:08,214][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:27:08,716][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:27:09,215][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:27:09,717][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:27:10,216][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:27:10,718][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:27:11,223][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:27:11,733][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:27:12,236][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:27:12,739][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:27:13,260][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:27:13,765][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:27:14,269][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:27:14,770][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:27:15,272][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:27:15,771][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:27:16,274][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:27:16,790][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:27:17,294][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:27:17,790][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:27:18,284][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:27:18,783][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:27:19,291][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:27:19,793][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:27:20,296][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:27:20,803][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:27:21,311][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:27:21,814][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:27:22,319][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:27:22,823][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:27:23,330][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:27:23,838][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:27:24,334][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:27:24,846][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:27:25,350][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:27:25,856][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:27:26,359][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:27:26,852][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9881 tokens. [2025-11-13 07:27:27,632][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.95%, Current % of VRAM taken: 58.19%, Block Peak % of device VRAM: 62.15%, ΔTime: 00:00:33 [2025-11-13 07:27:28,383][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:27:28,385][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:27:28,386][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:27:29,298][__main__][INFO] - Iteration 586 took 1m 1s (40.96% Gen, 57.57% Train). Generation: 25s, Training: 35s. Estimated remaining time: 42h 20m 28s. Estimated total time: 51h 38m 16s. Time estimates for 10 more iterations: 10m 19s, 100 more iterations: 1h 43m 16s, 500 more iterations: 8h 36m 22s. [2025-11-13 07:27:29,300][__main__][INFO] - Starting iteration 586. [2025-11-13 07:27:29,811][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 07:27:29,812][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:27:49,265][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:28:01,361][__main__][INFO] - Number of regex retries in iteration 586: 1 [2025-11-13 07:28:01,361][__main__][INFO] - agents played in iteration 586 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:28:02,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:28:02,243][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:28:02,267][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:28:02,290][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:28:02,290][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:28:02,291][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:28:03,167][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:28:03,625][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:28:04,141][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:28:04,646][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:28:05,151][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:28:05,662][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:28:06,168][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:28:06,683][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:28:07,189][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:28:07,700][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:28:08,208][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:28:08,714][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:28:09,219][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:28:09,724][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:28:10,230][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:28:10,736][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:28:11,245][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:28:11,756][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:28:12,268][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:28:12,769][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:28:13,282][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:28:13,789][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:28:14,297][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:28:14,806][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:28:15,311][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:28:15,819][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:28:16,320][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:28:16,826][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:28:17,327][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:28:17,829][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:28:18,327][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:28:18,830][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:28:19,329][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:28:19,830][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:28:20,333][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:28:20,835][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:28:21,340][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:28:21,844][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:28:22,351][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:28:22,862][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:28:23,365][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:28:23,885][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:28:24,387][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:28:24,888][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:28:25,398][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:28:25,900][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:28:26,405][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:28:26,907][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:28:27,413][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:28:27,926][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:28:28,431][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:28:28,936][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:28:29,445][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:28:29,945][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:28:30,447][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:28:30,947][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:28:31,445][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:28:31,948][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:28:32,450][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:28:32,951][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:28:33,446][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:28:33,970][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:28:34,485][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:28:34,987][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:28:35,489][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9980 tokens. [2025-11-13 07:28:36,275][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.17%, ΔTime: 00:00:33 [2025-11-13 07:28:37,067][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:28:37,068][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:28:37,070][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:28:37,988][__main__][INFO] - Iteration 587 took 1m 8s (46.28% Gen, 52.38% Train). Generation: 31s, Training: 35s. Estimated remaining time: 47h 29m 54s. Estimated total time: 56h 48m 50s. Time estimates for 10 more iterations: 11m 21s, 100 more iterations: 1h 53m 37s, 500 more iterations: 9h 28m 8s. [2025-11-13 07:28:37,990][__main__][INFO] - Starting iteration 587. [2025-11-13 07:28:38,474][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 07:28:38,474][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:28:53,015][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:28:54,013][mllm.models.large_language_model_local][WARNING] - Response Proposal: 20 hats, 20 books, 20 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:29:05,231][__main__][INFO] - Number of regex retries in iteration 587: 2 [2025-11-13 07:29:05,232][__main__][INFO] - agents played in iteration 587 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:29:06,099][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:29:06,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:29:06,149][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:29:06,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:29:06,173][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:29:06,174][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:29:07,039][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:29:07,515][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:29:08,033][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:29:08,541][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:29:09,048][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:29:10,154][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:29:11,354][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:29:11,868][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:29:12,401][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:29:12,908][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:29:13,416][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:29:13,924][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:29:14,423][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:29:14,923][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:29:15,430][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:29:15,931][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:29:16,431][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:29:16,938][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:29:17,435][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:29:17,941][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:29:18,443][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:29:18,948][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:29:19,457][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:29:19,959][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:29:20,465][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:29:20,971][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:29:21,472][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:29:21,989][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:29:22,492][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:29:22,998][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:29:23,502][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:29:24,008][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:29:24,518][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:29:25,025][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:29:25,532][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:29:26,037][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:29:26,542][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:29:27,048][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:29:27,558][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:29:28,065][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:29:28,569][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:29:29,069][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:29:29,569][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:29:30,066][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:29:30,568][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:29:31,068][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:29:31,567][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:29:32,066][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:29:32,565][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:29:33,072][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:29:33,571][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:29:34,071][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:29:34,572][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:29:35,068][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:29:35,571][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:29:36,074][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:29:36,574][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:29:37,072][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:29:37,566][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:29:38,067][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:29:38,568][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:29:39,070][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:29:39,569][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:29:40,069][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:29:40,569][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10105 tokens. [2025-11-13 07:29:41,393][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:34 [2025-11-13 07:29:42,054][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:29:42,056][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:29:42,058][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:29:42,886][__main__][INFO] - Iteration 588 took 1m 4s (41.54% Gen, 57.17% Train). Generation: 26s, Training: 36s. Estimated remaining time: 44h 20m 36s. Estimated total time: 53h 40m 37s. Time estimates for 10 more iterations: 10m 44s, 100 more iterations: 1h 47m 21s, 500 more iterations: 8h 56m 46s. [2025-11-13 07:29:42,888][__main__][INFO] - Starting iteration 588. [2025-11-13 07:29:43,406][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 07:29:43,407][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:30:22,301][__main__][INFO] - Number of regex retries in iteration 588: 0 [2025-11-13 07:30:22,302][__main__][INFO] - agents played in iteration 588 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:30:23,158][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:30:23,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:30:23,211][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:30:23,235][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:30:23,235][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:30:23,236][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:30:24,038][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:30:24,494][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:30:25,027][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:30:25,529][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:30:26,032][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:30:26,535][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:30:27,036][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:30:27,540][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:30:28,046][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:30:28,550][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:30:29,053][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:30:29,555][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:30:30,057][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:30:30,558][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:30:31,064][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:30:31,570][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:30:32,073][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:30:32,577][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:30:33,081][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:30:33,580][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:30:34,082][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:30:34,583][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:30:35,083][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:30:35,583][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:30:36,083][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:30:36,584][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:30:37,082][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:30:37,580][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:30:38,081][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:30:38,579][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:30:39,082][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:30:39,574][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:30:40,065][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:30:40,565][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:30:41,065][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:30:41,568][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:30:42,069][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:30:42,569][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:30:43,070][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:30:43,569][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:30:44,069][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:30:44,572][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:30:45,076][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:30:45,578][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:30:46,083][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:30:46,583][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:30:47,097][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:30:47,596][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:30:48,095][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:30:48,604][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:30:49,104][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:30:49,616][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:30:50,115][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:30:50,616][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:30:51,129][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:30:51,628][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:30:52,127][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:30:52,623][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:30:53,128][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:30:53,636][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:30:54,136][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:30:54,630][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:30:55,130][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:30:55,630][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:30:56,136][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10030 tokens. [2025-11-13 07:30:56,918][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.21%, ΔTime: 00:00:32 [2025-11-13 07:30:57,697][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:30:57,699][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:30:57,701][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:30:58,619][__main__][INFO] - Iteration 589 took 1m 15s (51.71% Gen, 47.06% Train). Generation: 38s, Training: 35s. Estimated remaining time: 53h 19m 23s. Estimated total time: 62h 40m 40s. Time estimates for 10 more iterations: 12m 32s, 100 more iterations: 2h 5m 21s, 500 more iterations: 10h 26m 46s. [2025-11-13 07:30:58,621][__main__][INFO] - Starting iteration 589. [2025-11-13 07:30:59,114][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 07:30:59,115][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:31:25,177][__main__][INFO] - Number of regex retries in iteration 589: 0 [2025-11-13 07:31:25,177][__main__][INFO] - agents played in iteration 589 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:31:25,972][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:31:26,002][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:31:26,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:31:26,052][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:31:26,053][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:31:26,054][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:31:26,845][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:31:27,306][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:31:27,825][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:31:28,332][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:31:28,833][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:31:29,336][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:31:29,842][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:31:30,346][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:31:30,849][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:31:31,349][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:31:31,857][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:31:32,360][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:31:32,865][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:31:33,367][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:31:33,868][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:31:34,367][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:31:34,864][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:31:35,362][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:31:35,860][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:31:36,353][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:31:36,852][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:31:37,352][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:31:37,849][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:31:38,350][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:31:38,849][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:31:39,347][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:31:39,843][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:31:40,336][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:31:40,841][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:31:41,346][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:31:41,846][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:31:42,348][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:31:42,844][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:31:43,347][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:31:43,850][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:31:44,353][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:31:44,859][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:31:45,360][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:31:45,854][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:31:46,356][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:31:46,859][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:31:47,364][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:31:47,865][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:31:48,366][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:31:48,869][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:31:49,372][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:31:49,873][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:31:50,375][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:31:50,876][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:31:51,376][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:31:51,873][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:31:52,371][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:31:52,869][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:31:53,367][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:31:53,867][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:31:54,364][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:31:54,865][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:31:55,367][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:31:55,868][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:31:56,368][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:31:56,868][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:31:57,370][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:31:57,871][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:31:58,372][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:31:58,872][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9945 tokens. [2025-11-13 07:31:59,654][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:32 [2025-11-13 07:32:00,426][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:32:00,428][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:32:00,429][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:32:01,364][__main__][INFO] - Iteration 590 took 1m 2s (41.87% Gen, 56.63% Train). Generation: 26s, Training: 35s. Estimated remaining time: 42h 30m 13s. Estimated total time: 51h 52m 33s. Time estimates for 10 more iterations: 10m 22s, 100 more iterations: 1h 43m 45s, 500 more iterations: 8h 38m 45s. [2025-11-13 07:32:01,366][__main__][INFO] - Starting iteration 590. [2025-11-13 07:32:01,852][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 58 and human policies 1. [2025-11-13 07:32:01,852][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:32:25,388][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:32:28,235][__main__][INFO] - Number of regex retries in iteration 590: 1 [2025-11-13 07:32:28,236][__main__][INFO] - agents played in iteration 590 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:32:29,146][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:32:29,205][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:32:29,247][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:32:29,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:32:29,281][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:32:29,282][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:32:30,087][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:32:30,637][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:32:31,161][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:32:31,668][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:32:32,170][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:32:32,669][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:32:33,169][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:32:33,672][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:32:34,172][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:32:34,673][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:32:35,181][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:32:35,687][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:32:36,193][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:32:36,701][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:32:37,208][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:32:37,717][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:32:38,216][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:32:38,715][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:32:39,234][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:32:39,732][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:32:40,240][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:32:40,745][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:32:41,242][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:32:41,751][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:32:42,254][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:32:42,779][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:32:43,272][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:32:43,771][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:32:44,275][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:32:44,778][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:32:45,274][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:32:45,773][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:32:46,277][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:32:46,770][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:32:47,271][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:32:47,765][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:32:48,264][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:32:48,758][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:32:49,260][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:32:49,762][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:32:50,262][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:32:50,761][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:32:51,256][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:32:51,748][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:32:52,247][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:32:52,742][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:32:53,233][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:32:53,747][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:32:54,239][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:32:54,734][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:32:55,236][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:32:55,735][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:32:56,251][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:32:56,747][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:32:57,246][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:32:57,750][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:32:58,251][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:32:58,745][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:32:59,236][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:32:59,733][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:33:00,245][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:33:00,743][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:33:01,241][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:33:01,743][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:33:02,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9827 tokens. [2025-11-13 07:33:03,062][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.99%, Current % of VRAM taken: 58.23%, Block Peak % of device VRAM: 62.11%, ΔTime: 00:00:32 [2025-11-13 07:33:03,705][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:33:03,706][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:33:03,708][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:33:05,379][__main__][INFO] - Iteration 591 took 1m 3s (41.53% Gen, 55.84% Train). Generation: 26s, Training: 35s. Estimated remaining time: 43h 33m 2s. Estimated total time: 52h 56m 26s. Time estimates for 10 more iterations: 10m 35s, 100 more iterations: 1h 45m 52s, 500 more iterations: 8h 49m 24s. [2025-11-13 07:33:05,381][__main__][INFO] - Starting iteration 591. [2025-11-13 07:33:05,871][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 07:33:05,873][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:33:22,420][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:33:33,877][__main__][INFO] - Number of regex retries in iteration 591: 1 [2025-11-13 07:33:33,878][__main__][INFO] - agents played in iteration 591 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:33:34,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:33:34,726][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:33:34,748][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:33:34,771][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:33:34,771][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:33:34,772][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:33:35,594][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:33:36,064][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:33:36,571][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:33:37,079][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:33:37,582][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:33:38,090][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:33:38,595][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:33:39,102][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:33:39,611][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:33:40,114][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:33:40,611][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:33:41,114][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:33:41,613][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:33:42,113][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:33:42,619][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:33:43,117][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:33:43,614][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:33:44,115][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:33:44,617][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:33:45,127][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:33:45,633][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:33:46,145][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:33:46,646][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:33:47,147][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:33:47,654][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:33:48,155][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:33:48,660][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:33:49,159][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:33:49,664][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:33:50,183][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:33:50,687][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:33:51,188][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:33:51,693][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:33:52,195][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:33:52,694][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:33:53,194][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:33:53,689][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:33:54,187][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:33:54,684][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:33:55,185][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:33:55,689][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:33:56,190][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:33:56,694][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:33:57,197][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:33:57,699][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:33:58,203][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:33:58,704][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:33:59,208][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:33:59,710][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:34:00,211][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:34:00,714][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:34:01,215][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:34:01,716][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:34:02,218][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:34:02,719][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:34:03,222][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:34:03,725][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:34:04,225][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:34:04,727][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:34:05,228][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:34:05,727][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:34:06,230][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:34:06,730][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:34:07,241][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:34:07,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9996 tokens. [2025-11-13 07:34:08,547][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.08%, ΔTime: 00:00:32 [2025-11-13 07:34:09,291][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:34:09,292][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:34:09,294][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:34:10,205][__main__][INFO] - Iteration 592 took 1m 4s (43.53% Gen, 55.05% Train). Generation: 28s, Training: 35s. Estimated remaining time: 44h 12m 15s. Estimated total time: 53h 36m 44s. Time estimates for 10 more iterations: 10m 43s, 100 more iterations: 1h 47m 13s, 500 more iterations: 8h 56m 7s. [2025-11-13 07:34:10,207][__main__][INFO] - Starting iteration 592. [2025-11-13 07:34:10,694][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 07:34:10,694][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:34:35,585][__main__][INFO] - Number of regex retries in iteration 592: 0 [2025-11-13 07:34:35,586][__main__][INFO] - agents played in iteration 592 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:34:36,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:34:36,418][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:34:36,443][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:34:36,466][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:34:36,466][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:34:36,467][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:34:37,295][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:34:37,770][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:34:38,284][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:34:38,789][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:34:39,296][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:34:39,809][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:34:40,320][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:34:40,827][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:34:41,334][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:34:41,834][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:34:42,331][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:34:42,829][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:34:43,330][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:34:43,829][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:34:44,323][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:34:44,822][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:34:45,322][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:34:45,825][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:34:46,324][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:34:46,823][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:34:47,324][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:34:47,825][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:34:48,325][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:34:48,828][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:34:49,326][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:34:49,826][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:34:50,327][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:34:50,826][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:34:51,328][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:34:51,826][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:34:52,327][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:34:52,826][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:34:53,325][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:34:53,824][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:34:54,327][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:34:54,826][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:34:55,327][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:34:55,825][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:34:56,328][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:34:56,825][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:34:57,324][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:34:57,823][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:34:58,326][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:34:58,823][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:34:59,317][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:34:59,819][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:35:00,313][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:35:00,811][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:35:01,305][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:35:01,797][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:35:02,293][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:35:02,803][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:35:03,297][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:35:03,793][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:35:04,295][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:35:04,796][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:35:05,289][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:35:05,788][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:35:06,291][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:35:06,798][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:35:07,298][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:35:07,800][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:35:08,297][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:35:08,796][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:35:09,299][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10017 tokens. [2025-11-13 07:35:10,073][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.36%, ΔTime: 00:00:32 [2025-11-13 07:35:10,828][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:35:10,829][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:35:10,831][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:35:11,718][__main__][INFO] - Iteration 593 took 1m 1s (40.79% Gen, 57.76% Train). Generation: 24s, Training: 35s. Estimated remaining time: 41h 25m 43s. Estimated total time: 50h 51m 13s. Time estimates for 10 more iterations: 10m 10s, 100 more iterations: 1h 41m 42s, 500 more iterations: 8h 28m 32s. [2025-11-13 07:35:11,720][__main__][INFO] - Starting iteration 593. [2025-11-13 07:35:12,194][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 07:35:12,195][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:35:42,394][__main__][INFO] - Number of regex retries in iteration 593: 0 [2025-11-13 07:35:42,396][__main__][INFO] - agents played in iteration 593 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:35:43,280][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:35:43,335][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:35:43,375][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:35:43,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:35:43,406][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:35:43,407][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:35:44,257][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:35:44,726][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:35:45,234][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:35:45,739][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:35:46,244][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:35:46,748][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:35:47,249][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:35:47,752][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:35:48,253][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:35:48,756][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:35:49,265][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:35:49,767][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:35:50,270][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:35:50,773][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:35:51,282][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:35:51,788][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:35:52,288][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:35:52,792][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:35:53,295][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:35:53,794][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:35:54,301][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:35:54,801][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:35:55,302][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:35:55,803][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:35:56,304][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:35:56,799][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:35:57,303][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:35:57,816][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:35:58,309][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:35:58,808][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:35:59,303][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:35:59,794][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:36:00,305][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:36:00,804][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:36:01,303][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:36:01,818][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:36:02,318][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:36:02,815][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:36:03,313][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:36:03,816][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:36:04,324][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:36:04,824][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:36:05,324][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:36:05,828][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:36:06,326][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:36:06,828][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:36:07,327][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:36:07,826][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:36:08,325][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:36:08,823][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:36:09,320][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:36:09,818][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:36:10,314][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:36:10,822][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:36:11,320][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:36:11,820][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:36:12,322][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:36:12,826][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:36:13,329][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:36:13,828][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:36:14,327][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:36:14,830][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:36:15,328][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:36:15,827][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:36:16,326][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9993 tokens. [2025-11-13 07:36:17,141][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.07%, Current % of VRAM taken: 58.32%, Block Peak % of device VRAM: 62.13%, ΔTime: 00:00:32 [2025-11-13 07:36:17,792][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:36:17,793][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:36:17,795][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:36:18,716][__main__][INFO] - Iteration 594 took 1m 6s (45.40% Gen, 53.21% Train). Generation: 30s, Training: 35s. Estimated remaining time: 45h 59m 28s. Estimated total time: 55h 26m 5s. Time estimates for 10 more iterations: 11m 5s, 100 more iterations: 1h 50m 52s, 500 more iterations: 9h 14m 20s. [2025-11-13 07:36:18,718][__main__][INFO] - Starting iteration 594. [2025-11-13 07:36:19,189][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 07:36:19,190][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:36:43,598][__main__][INFO] - Number of regex retries in iteration 594: 0 [2025-11-13 07:36:43,599][__main__][INFO] - agents played in iteration 594 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:36:44,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:36:44,494][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:36:44,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:36:44,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:36:44,551][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:36:44,553][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:36:45,379][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:36:45,841][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:36:46,356][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:36:46,861][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:36:47,360][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:36:47,866][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:36:48,366][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:36:48,877][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:36:49,376][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:36:49,885][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:36:50,387][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:36:50,889][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:36:51,394][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:36:51,902][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:36:52,407][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:36:52,913][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:36:53,417][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:36:53,930][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:36:54,433][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:36:54,938][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:36:55,443][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:36:55,945][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:36:56,448][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:36:56,950][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:36:57,458][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:36:57,961][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:36:58,462][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:36:58,963][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:36:59,462][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:36:59,963][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:37:00,461][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:37:00,962][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:37:01,461][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:37:01,965][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:37:02,467][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:37:02,968][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:37:03,491][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:37:03,987][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:37:04,514][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:37:05,017][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:37:05,513][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:37:06,018][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:37:06,512][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:37:07,019][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:37:07,518][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:37:08,021][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:37:08,524][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:37:09,026][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:37:09,526][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:37:10,025][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:37:10,528][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:37:11,035][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:37:11,539][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:37:12,030][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:37:12,528][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:37:13,028][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:37:13,528][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:37:14,027][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:37:14,527][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:37:15,032][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:37:15,532][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:37:16,032][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:37:16,529][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:37:17,030][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:37:17,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9886 tokens. [2025-11-13 07:37:18,369][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:32 [2025-11-13 07:37:19,134][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:37:19,135][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:37:19,137][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:37:20,048][__main__][INFO] - Iteration 595 took 1m 0s (40.11% Gen, 58.39% Train). Generation: 24s, Training: 35s. Estimated remaining time: 41h 15m 23s. Estimated total time: 50h 43m 1s. Time estimates for 10 more iterations: 10m 8s, 100 more iterations: 1h 41m 26s, 500 more iterations: 8h 27m 10s. [2025-11-13 07:37:20,051][__main__][INFO] - Starting iteration 595. [2025-11-13 07:37:20,532][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 07:37:20,532][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:37:34,232][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:37:41,850][__main__][INFO] - Number of regex retries in iteration 595: 1 [2025-11-13 07:37:41,850][__main__][INFO] - agents played in iteration 595 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:37:42,915][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:37:42,967][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:37:43,001][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:37:43,032][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:37:43,033][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:37:43,034][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:37:43,930][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:37:44,480][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:37:44,990][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:37:45,504][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:37:46,007][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:37:46,511][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:37:47,016][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:37:47,522][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:37:48,035][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:37:48,543][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:37:49,048][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:37:49,554][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:37:50,058][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:37:50,564][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:37:51,068][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:37:51,573][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:37:52,078][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:37:52,587][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:37:53,098][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:37:53,605][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:37:54,136][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:37:54,651][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:37:55,158][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:37:55,665][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:37:56,170][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:37:56,672][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:37:57,177][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:37:57,684][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:37:58,191][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:37:58,704][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:37:59,206][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:37:59,718][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:38:00,221][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:38:00,718][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:38:01,226][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:38:01,722][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:38:02,221][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:38:02,722][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:38:03,220][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:38:03,725][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:38:04,221][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:38:04,727][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:38:05,235][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:38:05,739][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:38:06,241][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:38:06,748][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:38:07,251][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:38:07,755][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:38:08,261][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:38:08,765][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:38:09,268][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:38:09,769][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:38:10,270][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:38:10,777][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:38:11,282][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:38:11,787][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:38:12,290][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:38:12,790][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:38:13,297][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:38:13,796][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:38:14,296][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:38:14,796][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:38:15,298][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:38:15,809][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:38:16,309][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9962 tokens. [2025-11-13 07:38:17,127][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.21%, ΔTime: 00:00:33 [2025-11-13 07:38:17,771][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:38:17,772][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:38:17,774][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:38:18,677][__main__][INFO] - Iteration 596 took 58s (36.66% Gen, 61.78% Train). Generation: 21s, Training: 35s. Estimated remaining time: 38h 58m 42s. Estimated total time: 48h 27m 19s. Time estimates for 10 more iterations: 9m 41s, 100 more iterations: 1h 36m 54s, 500 more iterations: 8h 4m 33s. [2025-11-13 07:38:18,679][__main__][INFO] - Starting iteration 596. [2025-11-13 07:38:19,168][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 07:38:19,169][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:38:53,639][__main__][INFO] - Number of regex retries in iteration 596: 0 [2025-11-13 07:38:53,640][__main__][INFO] - agents played in iteration 596 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:38:54,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:38:54,521][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:38:54,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:38:54,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:38:54,567][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:38:54,567][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:38:55,396][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:38:55,993][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:38:56,503][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:38:57,008][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:38:57,515][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:38:58,022][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:38:58,526][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:38:59,034][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:38:59,543][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:39:00,049][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:39:00,562][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:39:01,065][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:39:01,566][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:39:02,073][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:39:02,578][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:39:03,075][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:39:03,579][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:39:04,077][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:39:04,579][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:39:05,079][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:39:05,579][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:39:06,078][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:39:06,578][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:39:07,079][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:39:07,580][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:39:08,085][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:39:08,593][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:39:09,097][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:39:09,597][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:39:10,107][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:39:10,608][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:39:11,117][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:39:11,618][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:39:12,118][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:39:12,626][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:39:13,127][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:39:13,627][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:39:14,134][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:39:14,633][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:39:15,145][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:39:15,640][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:39:16,142][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:39:16,647][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:39:17,154][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:39:17,664][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:39:18,169][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:39:18,668][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:39:19,173][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:39:19,677][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:39:20,181][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:39:20,691][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:39:21,195][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:39:21,699][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:39:22,206][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:39:22,714][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:39:23,229][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:39:23,732][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:39:24,231][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:39:24,731][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:39:25,232][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:39:25,748][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:39:26,258][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:39:26,761][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:39:27,266][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:39:27,771][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10046 tokens. [2025-11-13 07:39:28,630][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.54%, ΔTime: 00:00:33 [2025-11-13 07:39:29,385][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:39:29,387][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:39:29,389][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:39:30,332][__main__][INFO] - Iteration 597 took 1m 11s (48.44% Gen, 50.23% Train). Generation: 34s, Training: 35s. Estimated remaining time: 49h 48m 25s. Estimated total time: 59h 18m 14s. Time estimates for 10 more iterations: 11m 51s, 100 more iterations: 1h 58m 36s, 500 more iterations: 9h 53m 2s. [2025-11-13 07:39:30,334][__main__][INFO] - Starting iteration 597. [2025-11-13 07:39:30,829][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 07:39:30,829][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:39:46,127][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:39:57,726][__main__][INFO] - Number of regex retries in iteration 597: 1 [2025-11-13 07:39:57,727][__main__][INFO] - agents played in iteration 597 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:39:58,572][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:39:58,595][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:39:58,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:39:58,640][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:39:58,640][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:39:58,642][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:39:59,488][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:39:59,951][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:40:00,460][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:40:00,977][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:40:01,479][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:40:01,982][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:40:02,484][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:40:02,987][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:40:03,488][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:40:03,988][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:40:04,492][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:40:04,992][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:40:05,502][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:40:06,004][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:40:06,508][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:40:07,013][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:40:07,517][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:40:08,021][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:40:08,522][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:40:09,037][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:40:09,546][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:40:10,059][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:40:10,565][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:40:11,067][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:40:11,566][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:40:12,067][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:40:12,574][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:40:13,075][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:40:13,585][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:40:14,085][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:40:14,584][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:40:15,088][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:40:15,587][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:40:16,088][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:40:16,589][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:40:17,089][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:40:17,589][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:40:18,091][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:40:18,591][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:40:19,091][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:40:19,591][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:40:20,090][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:40:20,591][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:40:21,090][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:40:21,589][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:40:22,088][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:40:22,587][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:40:23,089][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:40:23,588][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:40:24,088][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:40:24,587][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:40:25,085][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:40:25,590][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:40:26,088][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:40:26,586][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:40:27,087][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:40:27,586][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:40:28,086][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:40:28,589][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:40:29,087][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:40:29,589][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:40:30,087][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:40:30,586][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:40:31,088][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:40:31,587][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10044 tokens. [2025-11-13 07:40:32,395][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.01%, Current % of VRAM taken: 58.26%, Block Peak % of device VRAM: 62.14%, ΔTime: 00:00:32 [2025-11-13 07:40:33,127][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:40:33,129][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:40:33,130][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:40:34,057][__main__][INFO] - Iteration 598 took 1m 3s (42.54% Gen, 55.99% Train). Generation: 26s, Training: 35s. Estimated remaining time: 43h 10m 34s. Estimated total time: 52h 41m 26s. Time estimates for 10 more iterations: 10m 32s, 100 more iterations: 1h 45m 22s, 500 more iterations: 8h 46m 54s. [2025-11-13 07:40:34,059][__main__][INFO] - Starting iteration 598. [2025-11-13 07:40:34,527][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 07:40:34,528][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:40:58,387][__main__][INFO] - Number of regex retries in iteration 598: 0 [2025-11-13 07:40:58,389][__main__][INFO] - agents played in iteration 598 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:40:59,251][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:40:59,273][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:40:59,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:40:59,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:40:59,322][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:40:59,323][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:41:00,150][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:41:00,615][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:41:01,129][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:41:01,636][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:41:02,143][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:41:02,650][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:41:03,160][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:41:03,667][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:41:04,180][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:41:04,684][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:41:05,195][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:41:05,709][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:41:06,217][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:41:06,724][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:41:07,235][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:41:07,745][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:41:08,258][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:41:08,761][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:41:09,267][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:41:09,774][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:41:10,277][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:41:10,780][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:41:11,284][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:41:11,803][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:41:12,306][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:41:12,808][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:41:13,311][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:41:13,813][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:41:14,315][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:41:14,818][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:41:15,322][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:41:15,826][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:41:16,331][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:41:16,834][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:41:17,330][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:41:17,834][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:41:18,331][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:41:18,832][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:41:19,333][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:41:19,841][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:41:20,337][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:41:20,838][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:41:21,343][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:41:21,849][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:41:22,382][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:41:22,890][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:41:23,398][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:41:23,907][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:41:24,414][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:41:24,920][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:41:25,425][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:41:25,929][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:41:26,435][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:41:26,935][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:41:27,438][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:41:27,937][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:41:28,442][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:41:28,944][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:41:29,444][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:41:29,944][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:41:30,447][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:41:30,951][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:41:31,453][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:41:31,955][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:41:32,461][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10208 tokens. [2025-11-13 07:41:33,277][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.33%, ΔTime: 00:00:33 [2025-11-13 07:41:33,895][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:41:33,896][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:41:33,898][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:41:34,716][__main__][INFO] - Iteration 599 took 1m 0s (39.64% Gen, 58.99% Train). Generation: 23s, Training: 35s. Estimated remaining time: 40h 37m 36s. Estimated total time: 50h 9m 30s. Time estimates for 10 more iterations: 10m 1s, 100 more iterations: 1h 40m 19s, 500 more iterations: 8h 21m 35s. [2025-11-13 07:41:34,718][__main__][INFO] - Starting iteration 599. [2025-11-13 07:41:35,215][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 07:41:35,215][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:42:07,751][__main__][INFO] - Number of regex retries in iteration 599: 0 [2025-11-13 07:42:07,753][__main__][INFO] - agents played in iteration 599 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:42:08,591][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:42:08,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:42:08,646][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:42:08,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:42:08,671][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:42:08,671][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:42:09,495][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:42:09,954][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:42:10,460][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:42:10,978][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:42:11,488][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:42:11,999][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:42:12,504][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:42:13,013][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:42:13,525][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:42:14,022][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:42:14,526][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:42:15,026][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:42:15,528][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:42:16,030][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:42:16,524][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:42:17,027][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:42:17,533][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:42:18,035][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:42:18,535][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:42:19,037][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:42:19,538][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:42:20,044][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:42:20,545][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:42:21,047][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:42:21,550][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:42:22,051][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:42:22,552][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:42:23,059][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:42:23,561][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:42:24,066][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:42:24,570][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:42:25,072][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:42:25,590][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:42:26,095][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:42:26,603][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:42:27,106][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:42:27,612][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:42:28,120][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:42:28,624][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:42:29,130][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:42:29,632][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:42:30,132][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:42:30,631][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:42:31,130][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:42:31,625][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:42:32,130][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:42:32,630][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:42:33,131][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:42:33,626][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:42:34,125][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:42:34,628][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:42:35,127][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:42:35,630][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:42:36,135][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:42:36,634][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:42:37,133][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:42:37,634][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:42:38,141][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:42:38,645][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:42:39,147][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:42:39,646][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:42:40,148][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:42:40,651][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:42:41,154][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:42:41,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9952 tokens. [2025-11-13 07:42:42,530][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:33 [2025-11-13 07:42:43,266][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:42:43,268][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:42:43,269][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:42:44,211][__main__][INFO] - Iteration 600 took 1m 8s (47.16% Gen, 51.48% Train). Generation: 32s, Training: 35s. Estimated remaining time: 47h 56m 48s. Estimated total time: 57h 29m 51s. Time estimates for 10 more iterations: 11m 29s, 100 more iterations: 1h 54m 59s, 500 more iterations: 9h 34m 58s. [2025-11-13 07:42:44,213][__main__][INFO] - Starting iteration 600. [2025-11-13 07:42:44,697][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 59 and human policies 1. [2025-11-13 07:42:44,698][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:43:13,820][__main__][INFO] - Number of regex retries in iteration 600: 0 [2025-11-13 07:43:13,821][__main__][INFO] - agents played in iteration 600 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:43:14,668][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:43:14,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:43:14,719][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:43:14,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:43:14,742][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:43:14,743][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:43:15,599][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:43:16,060][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:43:16,575][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:43:17,079][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:43:17,579][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:43:18,078][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:43:18,590][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:43:19,091][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:43:19,606][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:43:20,105][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:43:20,607][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:43:21,122][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:43:21,628][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:43:22,129][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:43:22,629][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:43:23,133][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:43:23,635][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:43:24,141][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:43:24,645][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:43:25,156][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:43:25,662][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:43:26,166][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:43:26,669][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:43:27,173][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:43:27,672][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:43:28,173][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:43:28,674][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:43:29,190][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:43:29,695][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:43:30,214][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:43:30,718][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:43:31,218][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:43:31,713][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:43:32,214][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:43:32,716][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:43:33,221][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:43:33,721][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:43:34,231][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:43:34,736][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:43:35,241][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:43:35,739][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:43:36,238][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:43:36,745][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:43:37,245][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:43:37,744][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:43:38,249][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:43:38,749][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:43:39,252][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:43:39,756][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:43:40,250][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:43:40,749][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:43:41,250][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:43:41,747][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:43:42,246][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:43:42,747][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:43:43,249][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:43:43,751][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:43:44,250][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:43:44,756][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:43:45,260][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:43:45,760][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:43:46,268][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:43:46,772][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:43:47,279][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:43:47,789][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10053 tokens. [2025-11-13 07:43:48,681][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:33 [2025-11-13 07:43:49,397][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:43:49,399][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:43:49,400][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:43:52,893][__main__][INFO] - Iteration 601 took 1m 8s (42.70% Gen, 52.17% Train). Generation: 29s, Training: 35s. Estimated remaining time: 47h 15m 39s. Estimated total time: 56h 49m 50s. Time estimates for 10 more iterations: 11m 21s, 100 more iterations: 1h 53m 39s, 500 more iterations: 9h 28m 18s. [2025-11-13 07:43:52,897][__main__][INFO] - Starting iteration 601. [2025-11-13 07:43:53,386][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 07:43:53,387][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:44:25,925][__main__][INFO] - Number of regex retries in iteration 601: 0 [2025-11-13 07:44:25,926][__main__][INFO] - agents played in iteration 601 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:44:26,804][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:44:26,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:44:26,864][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:44:26,886][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:44:26,887][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:44:26,888][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:44:27,658][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:44:28,112][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:44:28,611][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:44:29,114][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:44:29,630][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:44:30,126][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:44:30,627][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:44:31,129][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:44:31,626][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:44:32,136][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:44:32,641][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:44:33,142][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:44:33,647][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:44:34,150][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:44:34,655][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:44:35,156][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:44:35,656][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:44:36,153][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:44:36,657][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:44:37,160][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:44:37,664][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:44:38,168][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:44:38,667][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:44:39,171][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:44:39,697][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:44:40,224][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:44:40,728][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:44:41,235][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:44:41,741][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:44:42,245][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:44:42,747][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:44:43,248][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:44:43,750][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:44:44,251][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:44:44,749][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:44:45,243][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:44:45,744][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:44:46,242][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:44:46,745][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:44:47,245][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:44:47,747][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:44:48,251][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:44:48,752][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:44:49,255][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:44:49,760][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:44:50,269][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:44:50,774][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:44:51,276][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:44:51,780][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:44:52,283][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:44:52,787][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:44:53,293][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:44:53,795][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:44:54,299][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:44:54,813][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:44:55,315][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:44:55,825][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:44:56,325][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:44:56,834][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:44:57,348][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:44:57,855][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:44:58,366][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:44:58,875][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:44:59,378][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:44:59,878][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10036 tokens. [2025-11-13 07:45:00,743][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.93%, Current % of VRAM taken: 58.18%, Block Peak % of device VRAM: 62.47%, ΔTime: 00:00:33 [2025-11-13 07:45:01,487][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:45:01,489][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:45:01,491][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:45:02,426][__main__][INFO] - Iteration 602 took 1m 9s (47.13% Gen, 51.51% Train). Generation: 32s, Training: 35s. Estimated remaining time: 47h 56m 40s. Estimated total time: 57h 32m 1s. Time estimates for 10 more iterations: 11m 30s, 100 more iterations: 1h 55m 4s, 500 more iterations: 9h 35m 20s. [2025-11-13 07:45:02,428][__main__][INFO] - Starting iteration 602. [2025-11-13 07:45:02,911][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 07:45:02,912][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:45:28,235][__main__][INFO] - Number of regex retries in iteration 602: 0 [2025-11-13 07:45:28,236][__main__][INFO] - agents played in iteration 602 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:45:29,163][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:45:29,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:45:29,217][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:45:29,241][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:45:29,242][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:45:29,243][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:45:30,031][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:45:30,482][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:45:30,988][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:45:31,491][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:45:31,996][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:45:32,501][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:45:33,000][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:45:33,503][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:45:34,007][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:45:34,509][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:45:35,012][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:45:35,643][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:45:36,172][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:45:36,675][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:45:37,175][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:45:37,677][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:45:38,181][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:45:38,683][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:45:39,182][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:45:39,681][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:45:40,186][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:45:40,690][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:45:41,188][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:45:41,686][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:45:42,186][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:45:42,691][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:45:43,196][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:45:43,701][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:45:44,205][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:45:44,706][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:45:45,210][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:45:45,715][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:45:46,244][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:45:46,745][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:45:47,246][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:45:47,746][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:45:48,247][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:45:48,762][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:45:49,267][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:45:49,768][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:45:50,268][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:45:50,767][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:45:51,270][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:45:51,768][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:45:52,269][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:45:52,772][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:45:53,270][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:45:53,769][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:45:54,268][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:45:54,771][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:45:55,281][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:45:55,784][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:45:56,289][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:45:56,792][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:45:57,297][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:45:57,807][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:45:58,313][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:45:58,818][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:45:59,321][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:45:59,824][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:46:00,331][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:46:00,832][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:46:01,333][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:46:01,850][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:46:02,355][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9948 tokens. [2025-11-13 07:46:03,222][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.32%, Block Peak % of device VRAM: 62.08%, ΔTime: 00:00:33 [2025-11-13 07:46:03,964][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:46:03,966][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:46:03,967][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:46:04,921][__main__][INFO] - Iteration 603 took 1m 2s (40.84% Gen, 57.62% Train). Generation: 25s, Training: 35s. Estimated remaining time: 42h 4m 8s. Estimated total time: 51h 40m 31s. Time estimates for 10 more iterations: 10m 20s, 100 more iterations: 1h 43m 21s, 500 more iterations: 8h 36m 45s. [2025-11-13 07:46:04,924][__main__][INFO] - Starting iteration 603. [2025-11-13 07:46:05,409][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 07:46:05,410][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:46:22,017][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:46:27,750][__main__][INFO] - Number of regex retries in iteration 603: 1 [2025-11-13 07:46:27,751][__main__][INFO] - agents played in iteration 603 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:46:28,588][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:46:28,611][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:46:28,634][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:46:28,657][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:46:28,657][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:46:28,659][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:46:29,515][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:46:29,979][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:46:30,491][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:46:30,997][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:46:31,502][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:46:32,008][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:46:32,514][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:46:33,021][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:46:33,536][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:46:34,043][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:46:34,556][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:46:35,060][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:46:35,562][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:46:36,074][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:46:36,574][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:46:37,075][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:46:37,581][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:46:38,082][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:46:38,593][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:46:39,093][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:46:39,594][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:46:40,098][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:46:40,605][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:46:41,107][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:46:41,612][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:46:42,110][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:46:42,616][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:46:43,118][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:46:43,624][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:46:44,131][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:46:44,634][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:46:45,136][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:46:45,649][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:46:46,150][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:46:46,662][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:46:47,165][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:46:47,674][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:46:48,182][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:46:48,681][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:46:49,186][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:46:49,694][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:46:50,199][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:46:50,703][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:46:51,205][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:46:51,708][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:46:52,210][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:46:52,711][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:46:53,211][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:46:53,712][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:46:54,212][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:46:54,712][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:46:55,210][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:46:55,712][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:46:56,216][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:46:56,718][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:46:57,225][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:46:57,735][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:46:58,236][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:46:58,753][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:46:59,258][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:46:59,759][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:47:00,278][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:47:00,781][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:47:01,290][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:47:01,796][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10033 tokens. [2025-11-13 07:47:02,695][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.12%, ΔTime: 00:00:33 [2025-11-13 07:47:03,345][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:47:03,347][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:47:03,349][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:47:04,153][__main__][INFO] - Iteration 604 took 58s (38.03% Gen, 60.60% Train). Generation: 22s, Training: 35s. Estimated remaining time: 39h 19m 51s. Estimated total time: 48h 57m 13s. Time estimates for 10 more iterations: 9m 47s, 100 more iterations: 1h 37m 54s, 500 more iterations: 8h 9m 32s. [2025-11-13 07:47:04,156][__main__][INFO] - Starting iteration 604. [2025-11-13 07:47:04,636][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 07:47:04,636][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:47:37,543][__main__][INFO] - Number of regex retries in iteration 604: 0 [2025-11-13 07:47:37,544][__main__][INFO] - agents played in iteration 604 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:47:38,405][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:47:38,435][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:47:38,462][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:47:38,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:47:38,486][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:47:38,487][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:47:39,308][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:47:39,778][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:47:40,286][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:47:40,793][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:47:41,294][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:47:41,797][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:47:42,303][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:47:42,798][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:47:43,299][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:47:43,807][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:47:44,312][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:47:44,812][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:47:45,311][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:47:45,813][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:47:46,322][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:47:46,823][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:47:47,325][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:47:47,835][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:47:48,336][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:47:48,839][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:47:49,344][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:47:49,846][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:47:50,349][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:47:50,854][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:47:51,358][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:47:51,855][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:47:52,358][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:47:52,874][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:47:53,372][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:47:53,871][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:47:54,370][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:47:54,871][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:47:55,370][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:47:55,869][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:47:56,371][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:47:56,887][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:47:57,388][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:47:57,891][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:47:58,383][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:47:58,880][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:47:59,379][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:47:59,871][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:48:00,372][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:48:00,884][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:48:01,386][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:48:01,887][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:48:02,382][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:48:02,878][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:48:03,382][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:48:03,884][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:48:04,387][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:48:04,891][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:48:05,390][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:48:05,894][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:48:06,396][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:48:06,899][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:48:07,403][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:48:07,907][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:48:08,405][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:48:08,911][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:48:09,417][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:48:09,926][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:48:10,428][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:48:10,932][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:48:11,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10011 tokens. [2025-11-13 07:48:12,332][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.16%, ΔTime: 00:00:33 [2025-11-13 07:48:13,103][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:48:13,105][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:48:13,107][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:48:14,008][__main__][INFO] - Iteration 605 took 1m 9s (47.44% Gen, 51.26% Train). Generation: 32s, Training: 35s. Estimated remaining time: 48h 10m 7s. Estimated total time: 57h 48m 40s. Time estimates for 10 more iterations: 11m 33s, 100 more iterations: 1h 55m 37s, 500 more iterations: 9h 38m 6s. [2025-11-13 07:48:14,011][__main__][INFO] - Starting iteration 605. [2025-11-13 07:48:14,486][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 07:48:14,487][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:48:29,771][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:48:40,123][__main__][INFO] - Number of regex retries in iteration 605: 1 [2025-11-13 07:48:40,124][__main__][INFO] - agents played in iteration 605 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:48:40,962][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:48:40,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:48:41,006][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:48:41,028][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:48:41,029][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:48:41,030][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:48:41,801][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:48:42,260][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:48:42,765][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:48:43,271][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:48:43,773][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:48:44,284][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:48:44,788][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:48:45,295][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:48:45,790][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:48:46,293][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:48:46,821][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:48:47,329][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:48:47,835][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:48:48,350][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:48:48,859][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:48:49,366][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:48:49,866][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:48:50,369][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:48:50,886][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:48:51,390][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:48:51,891][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:48:52,391][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:48:52,894][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:48:53,401][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:48:53,906][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:48:54,408][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:48:54,910][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:48:55,411][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:48:55,914][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:48:56,412][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:48:56,907][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:48:57,410][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:48:57,914][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:48:58,414][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:48:58,912][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:48:59,407][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:48:59,910][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:49:00,410][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:49:00,908][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:49:01,412][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:49:01,914][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:49:02,418][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:49:02,922][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:49:03,425][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:49:03,932][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:49:04,436][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:49:04,938][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:49:05,438][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:49:05,938][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:49:06,446][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:49:06,945][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:49:07,448][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:49:07,944][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:49:08,450][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:49:08,950][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:49:09,464][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:49:09,968][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:49:10,469][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:49:10,974][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:49:11,481][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:49:11,987][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:49:12,495][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:49:13,002][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:49:13,508][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:49:14,009][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9944 tokens. [2025-11-13 07:49:14,874][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.98%, Current % of VRAM taken: 58.23%, Block Peak % of device VRAM: 62.36%, ΔTime: 00:00:33 [2025-11-13 07:49:15,627][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:49:15,629][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:49:15,631][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:49:16,549][__main__][INFO] - Iteration 606 took 1m 2s (41.31% Gen, 57.21% Train). Generation: 25s, Training: 35s. Estimated remaining time: 42h 3m 36s. Estimated total time: 51h 43m 11s. Time estimates for 10 more iterations: 10m 20s, 100 more iterations: 1h 43m 26s, 500 more iterations: 8h 37m 11s. [2025-11-13 07:49:16,551][__main__][INFO] - Starting iteration 606. [2025-11-13 07:49:17,043][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 07:49:17,043][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:49:38,437][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:49:46,295][__main__][INFO] - Number of regex retries in iteration 606: 1 [2025-11-13 07:49:46,296][__main__][INFO] - agents played in iteration 606 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:49:47,124][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:49:47,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:49:47,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:49:47,197][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:49:47,198][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:49:47,199][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:49:48,067][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:49:48,529][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:49:49,037][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:49:49,538][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:49:50,039][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:49:50,548][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:49:51,050][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:49:51,553][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:49:52,054][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:49:52,555][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:49:53,066][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:49:53,567][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:49:54,073][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:49:54,575][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:49:55,077][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:49:55,582][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:49:56,085][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:49:56,588][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:49:57,089][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:49:57,591][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:49:58,108][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:49:58,611][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:49:59,115][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:49:59,618][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:50:00,122][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:50:00,632][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:50:01,136][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:50:01,638][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:50:02,143][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:50:02,651][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:50:03,171][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:50:03,675][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:50:04,177][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:50:04,672][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:50:05,168][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:50:05,668][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:50:06,176][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:50:06,674][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:50:07,189][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:50:07,690][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:50:08,193][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:50:08,690][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:50:09,192][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:50:09,695][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:50:10,189][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:50:10,689][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:50:11,202][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:50:11,708][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:50:12,218][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:50:12,722][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:50:13,227][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:50:13,734][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:50:14,242][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:50:14,750][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:50:15,258][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:50:15,763][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:50:16,294][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:50:16,804][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:50:17,310][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:50:17,813][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:50:18,317][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:50:18,823][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:50:19,332][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:50:19,839][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:50:20,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10043 tokens. [2025-11-13 07:50:21,245][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 07:50:22,001][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:50:22,003][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:50:22,005][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:50:22,920][__main__][INFO] - Iteration 607 took 1m 5s (44.40% Gen, 54.21% Train). Generation: 29s, Training: 35s. Estimated remaining time: 45h 13m 12s. Estimated total time: 54h 53m 53s. Time estimates for 10 more iterations: 10m 58s, 100 more iterations: 1h 49m 47s, 500 more iterations: 9h 8m 58s. [2025-11-13 07:50:22,922][__main__][INFO] - Starting iteration 607. [2025-11-13 07:50:23,410][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 07:50:23,411][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:50:43,176][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 1 z ball did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:50:47,497][__main__][INFO] - Number of regex retries in iteration 607: 1 [2025-11-13 07:50:47,498][__main__][INFO] - agents played in iteration 607 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:50:48,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:50:48,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:50:48,359][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:50:48,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:50:48,384][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:50:48,385][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:50:49,246][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:50:49,718][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:50:50,225][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:50:50,731][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:50:51,240][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:50:51,746][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:50:52,253][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:50:52,759][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:50:53,266][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:50:53,781][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:50:54,282][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:50:54,799][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:50:55,294][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:50:55,788][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:50:56,300][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:50:56,800][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:50:57,300][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:50:57,801][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:50:58,302][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:50:58,814][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:50:59,315][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:50:59,818][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:51:00,319][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:51:00,821][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:51:01,325][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:51:01,826][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:51:02,326][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:51:02,827][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:51:03,332][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:51:03,833][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:51:04,334][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:51:04,836][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:51:05,340][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:51:05,850][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:51:06,355][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:51:06,864][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:51:07,371][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:51:07,872][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:51:08,374][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:51:08,879][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:51:09,375][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:51:09,880][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:51:10,382][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:51:10,883][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:51:11,385][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:51:11,886][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:51:12,387][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:51:12,887][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:51:13,401][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:51:13,904][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:51:14,407][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:51:14,910][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:51:15,410][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:51:15,920][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:51:16,419][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:51:16,926][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:51:17,429][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:51:17,932][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:51:18,431][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:51:18,935][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:51:19,438][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:51:19,945][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:51:20,450][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:51:20,957][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:51:21,460][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10070 tokens. [2025-11-13 07:51:22,322][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.07%, Current % of VRAM taken: 58.32%, Block Peak % of device VRAM: 62.13%, ΔTime: 00:00:33 [2025-11-13 07:51:23,076][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:51:23,078][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:51:23,080][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:51:23,978][__main__][INFO] - Iteration 608 took 1m 0s (39.77% Gen, 58.75% Train). Generation: 24s, Training: 35s. Estimated remaining time: 40h 46m 43s. Estimated total time: 50h 28m 25s. Time estimates for 10 more iterations: 10m 5s, 100 more iterations: 1h 40m 56s, 500 more iterations: 8h 24m 44s. [2025-11-13 07:51:23,980][__main__][INFO] - Starting iteration 608. [2025-11-13 07:51:24,460][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 07:51:24,461][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:51:39,772][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:51:48,534][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:51:51,113][__main__][INFO] - Number of regex retries in iteration 608: 2 [2025-11-13 07:51:51,114][__main__][INFO] - agents played in iteration 608 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:51:51,987][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:51:52,009][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:51:52,034][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:51:52,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:51:52,057][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:51:52,058][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:51:52,928][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:51:53,388][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:51:53,892][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:51:54,408][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:51:54,913][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:51:55,427][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:51:55,931][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:51:56,433][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:51:56,944][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:51:57,453][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:51:57,965][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:51:58,468][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:51:58,973][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:51:59,480][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:51:59,984][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:52:00,486][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:52:00,998][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:52:01,534][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:52:02,043][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:52:02,552][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:52:03,052][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:52:03,556][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:52:04,058][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:52:04,561][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:52:05,065][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:52:05,566][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:52:06,070][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:52:06,572][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:52:07,073][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:52:07,578][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:52:08,087][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:52:08,590][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:52:09,093][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:52:09,595][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:52:10,107][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:52:10,607][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:52:11,108][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:52:11,618][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:52:12,118][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:52:12,632][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:52:13,130][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:52:13,637][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:52:14,153][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:52:14,657][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:52:15,168][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:52:15,669][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:52:16,173][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:52:16,676][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:52:17,177][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:52:17,677][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:52:18,176][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:52:18,676][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:52:19,177][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:52:19,677][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:52:20,178][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:52:20,680][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:52:21,180][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:52:21,681][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:52:22,181][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:52:22,681][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:52:23,183][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:52:23,684][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:52:24,189][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:52:24,702][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:52:25,207][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9985 tokens. [2025-11-13 07:52:26,058][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:33 [2025-11-13 07:52:26,685][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:52:26,687][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:52:26,689][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:52:27,489][__main__][INFO] - Iteration 609 took 1m 3s (42.29% Gen, 56.44% Train). Generation: 26s, Training: 35s. Estimated remaining time: 42h 48m 42s. Estimated total time: 52h 31m 28s. Time estimates for 10 more iterations: 10m 30s, 100 more iterations: 1h 45m 2s, 500 more iterations: 8h 45m 14s. [2025-11-13 07:52:27,491][__main__][INFO] - Starting iteration 609. [2025-11-13 07:52:27,976][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 07:52:27,976][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:52:58,683][__main__][INFO] - Number of regex retries in iteration 609: 0 [2025-11-13 07:52:58,684][__main__][INFO] - agents played in iteration 609 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:52:59,540][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:52:59,564][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:52:59,589][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:52:59,612][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:52:59,613][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:52:59,614][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:53:00,453][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:53:00,926][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:53:01,438][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:53:01,946][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:53:02,456][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:53:02,963][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:53:03,471][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:53:03,975][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:53:04,480][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:53:04,980][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:53:05,481][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:53:05,982][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:53:06,487][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:53:06,987][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:53:07,490][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:53:07,991][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:53:08,495][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:53:09,012][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:53:09,513][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:53:10,025][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:53:10,528][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:53:11,032][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:53:11,536][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:53:12,040][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:53:12,541][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:53:13,045][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:53:13,550][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:53:14,057][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:53:14,556][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:53:15,059][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:53:15,564][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:53:16,071][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:53:16,571][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:53:17,074][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:53:17,569][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:53:18,074][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:53:18,580][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:53:19,072][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:53:19,570][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:53:20,072][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:53:20,580][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:53:21,079][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:53:21,575][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:53:22,074][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:53:22,578][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:53:23,081][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:53:23,583][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:53:24,076][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:53:24,576][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:53:25,079][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:53:25,581][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:53:26,087][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:53:26,590][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:53:27,092][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:53:27,594][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:53:28,096][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:53:28,604][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:53:29,108][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:53:29,607][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:53:30,123][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:53:30,628][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:53:31,147][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:53:31,645][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:53:32,148][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:53:32,660][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10197 tokens. [2025-11-13 07:53:33,534][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 07:53:34,262][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:53:34,263][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:53:34,265][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:53:35,328][__main__][INFO] - Iteration 610 took 1m 7s (45.59% Gen, 52.83% Train). Generation: 30s, Training: 35s. Estimated remaining time: 46h 23m 46s. Estimated total time: 56h 7m 40s. Time estimates for 10 more iterations: 11m 13s, 100 more iterations: 1h 52m 15s, 500 more iterations: 9h 21m 16s. [2025-11-13 07:53:35,331][__main__][INFO] - Starting iteration 610. [2025-11-13 07:53:36,069][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 60 and human policies 1. [2025-11-13 07:53:36,070][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:54:07,190][__main__][INFO] - Number of regex retries in iteration 610: 0 [2025-11-13 07:54:07,190][__main__][INFO] - agents played in iteration 610 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:54:08,050][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:54:08,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:54:08,098][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:54:08,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:54:08,121][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:54:08,122][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:54:08,991][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:54:09,459][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:54:09,970][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:54:10,472][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:54:10,975][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:54:11,477][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:54:11,981][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:54:12,488][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:54:12,993][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:54:13,496][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:54:13,999][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:54:14,507][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:54:15,012][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:54:15,515][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:54:16,038][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:54:16,540][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:54:17,046][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:54:17,547][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:54:18,048][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:54:18,558][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:54:19,062][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:54:19,565][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:54:20,066][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:54:20,566][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:54:21,073][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:54:21,576][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:54:22,084][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:54:22,590][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:54:23,092][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:54:23,591][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:54:24,091][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:54:24,594][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:54:25,095][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:54:25,595][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:54:26,094][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:54:26,595][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:54:27,094][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:54:27,594][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:54:28,095][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:54:28,596][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:54:29,099][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:54:29,596][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:54:30,096][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:54:30,596][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:54:31,096][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:54:31,601][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:54:32,101][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:54:32,599][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:54:33,094][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:54:33,602][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:54:34,108][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:54:34,621][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:54:35,127][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:54:35,681][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:54:36,190][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:54:36,699][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:54:38,101][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:54:38,707][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:54:39,212][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:54:39,731][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:54:40,237][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:54:40,737][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:54:41,241][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:54:41,746][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:54:42,257][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10029 tokens. [2025-11-13 07:54:43,151][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.22%, ΔTime: 00:00:34 [2025-11-13 07:54:43,822][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:54:43,824][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:54:43,826][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:54:45,560][__main__][INFO] - Iteration 611 took 1m 9s (44.78% Gen, 52.72% Train). Generation: 31s, Training: 36s. Estimated remaining time: 48h 9m 32s. Estimated total time: 57h 54m 36s. Time estimates for 10 more iterations: 11m 34s, 100 more iterations: 1h 55m 49s, 500 more iterations: 9h 39m 6s. [2025-11-13 07:54:45,594][__main__][INFO] - Starting iteration 611. [2025-11-13 07:54:46,090][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 07:54:46,091][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:55:17,666][__main__][INFO] - Number of regex retries in iteration 611: 0 [2025-11-13 07:55:17,667][__main__][INFO] - agents played in iteration 611 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:55:18,486][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:55:18,514][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:55:18,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:55:18,564][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:55:18,565][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:55:18,566][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:55:19,341][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:55:19,797][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:55:20,305][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:55:20,806][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:55:21,308][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:55:21,815][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:55:22,315][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:55:22,816][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:55:23,324][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:55:23,825][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:55:24,337][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:55:24,844][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:55:25,345][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:55:25,849][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:55:26,353][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:55:26,863][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:55:27,366][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:55:27,867][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:55:28,375][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:55:28,881][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:55:29,384][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:55:29,889][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:55:30,393][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:55:30,899][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:55:31,402][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:55:31,900][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:55:32,402][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:55:32,906][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:55:33,409][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:55:33,912][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:55:34,407][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:55:34,909][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:55:35,410][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:55:35,913][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:55:36,416][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:55:36,915][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:55:37,415][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:55:37,919][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:55:38,417][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:55:38,926][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:55:39,430][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:55:39,933][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:55:40,446][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:55:40,952][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:55:41,464][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:55:41,968][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:55:42,474][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:55:42,978][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:55:43,484][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:55:43,990][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:55:44,495][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:55:45,003][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:55:45,511][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:55:46,018][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:55:46,525][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:55:47,028][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:55:47,534][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:55:48,062][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:55:48,573][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:55:49,082][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:55:49,590][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:55:50,115][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:55:50,628][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:55:51,137][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:55:51,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10166 tokens. [2025-11-13 07:55:52,529][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:33 [2025-11-13 07:55:53,317][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:55:53,320][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:55:53,322][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:55:54,230][__main__][INFO] - Iteration 612 took 1m 8s (46.34% Gen, 52.33% Train). Generation: 31s, Training: 35s. Estimated remaining time: 47h 0m 47s. Estimated total time: 56h 47m 0s. Time estimates for 10 more iterations: 11m 21s, 100 more iterations: 1h 53m 34s, 500 more iterations: 9h 27m 50s. [2025-11-13 07:55:54,232][__main__][INFO] - Starting iteration 612. [2025-11-13 07:55:54,715][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 07:55:54,715][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:56:18,459][__main__][INFO] - Number of regex retries in iteration 612: 0 [2025-11-13 07:56:18,459][__main__][INFO] - agents played in iteration 612 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:56:19,338][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:56:19,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:56:19,398][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:56:19,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:56:19,422][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:56:19,423][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:56:20,211][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:56:20,670][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:56:21,175][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:56:21,676][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:56:22,190][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:56:22,694][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:56:23,198][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:56:23,703][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:56:24,211][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:56:24,716][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:56:25,220][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:56:25,724][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:56:26,232][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:56:26,735][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:56:27,247][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:56:27,759][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:56:28,262][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:56:28,774][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:56:29,271][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:56:29,765][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:56:30,270][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:56:30,773][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:56:31,273][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:56:31,775][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:56:32,279][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:56:32,788][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:56:33,288][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:56:33,789][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:56:34,288][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:56:34,788][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:56:35,290][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:56:35,788][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:56:36,288][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:56:36,788][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:56:37,288][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:56:37,813][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:56:38,319][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:56:38,824][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:56:39,326][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:56:39,829][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:56:40,330][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:56:40,833][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:56:41,333][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:56:41,838][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:56:42,346][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:56:42,854][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:56:43,359][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:56:43,862][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:56:44,367][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:56:44,872][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:56:45,377][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:56:45,897][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:56:46,402][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:56:46,912][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:56:47,413][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:56:47,921][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:56:48,428][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:56:48,935][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:56:49,442][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:56:49,942][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:56:50,457][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:56:50,962][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:56:51,477][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:56:51,985][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:56:52,490][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10052 tokens. [2025-11-13 07:56:53,344][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 07:56:54,117][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:56:54,118][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:56:54,120][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:56:55,261][__main__][INFO] - Iteration 613 took 1m 0s (39.21% Gen, 58.90% Train). Generation: 23s, Training: 35s. Estimated remaining time: 40h 40m 7s. Estimated total time: 50h 27m 20s. Time estimates for 10 more iterations: 10m 5s, 100 more iterations: 1h 40m 54s, 500 more iterations: 8h 24m 33s. [2025-11-13 07:56:55,263][__main__][INFO] - Starting iteration 613. [2025-11-13 07:56:55,878][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 07:56:55,879][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:57:04,798][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:57:15,575][__main__][INFO] - Number of regex retries in iteration 613: 1 [2025-11-13 07:57:15,576][__main__][INFO] - agents played in iteration 613 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:57:16,567][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:57:16,594][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:57:16,620][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:57:16,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:57:16,643][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:57:16,643][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:57:17,498][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:57:17,980][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:57:18,490][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:57:19,003][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:57:19,512][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:57:20,019][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:57:20,526][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:57:21,030][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:57:21,533][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:57:22,034][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:57:22,540][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:57:23,044][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:57:23,551][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:57:24,052][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:57:24,569][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:57:25,070][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:57:25,571][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:57:26,075][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:57:26,575][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:57:27,085][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:57:27,586][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:57:28,092][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:57:28,595][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:57:29,095][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:57:29,600][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:57:30,107][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:57:30,614][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:57:31,120][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:57:31,624][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:57:32,128][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:57:32,633][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:57:33,137][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:57:33,647][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:57:34,146][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:57:34,646][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:57:35,166][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:57:35,664][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:57:36,162][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:57:36,659][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:57:37,157][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:57:37,669][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:57:38,168][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:57:38,668][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:57:39,170][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:57:39,668][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:57:40,173][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:57:40,672][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:57:41,171][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:57:41,678][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:57:42,180][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:57:42,679][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:57:43,181][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:57:43,681][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:57:44,193][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:57:44,699][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:57:45,207][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:57:45,711][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:57:46,984][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:57:47,485][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:57:47,982][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:57:48,485][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:57:48,984][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:57:49,500][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:57:50,009][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:57:50,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10085 tokens. [2025-11-13 07:57:51,419][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 07:57:52,143][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:57:52,145][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:57:52,146][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:57:52,965][__main__][INFO] - Iteration 614 took 57s (34.50% Gen, 64.06% Train). Generation: 19s, Training: 36s. Estimated remaining time: 37h 46m 10s. Estimated total time: 47h 34m 22s. Time estimates for 10 more iterations: 9m 30s, 100 more iterations: 1h 35m 8s, 500 more iterations: 7h 55m 43s. [2025-11-13 07:57:52,967][__main__][INFO] - Starting iteration 614. [2025-11-13 07:57:53,470][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 07:57:53,471][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:58:20,186][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 11 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 07:58:29,356][__main__][INFO] - Number of regex retries in iteration 614: 1 [2025-11-13 07:58:29,357][__main__][INFO] - agents played in iteration 614 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:58:30,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:58:30,200][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:58:30,236][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:58:30,259][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:58:30,260][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:58:30,260][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:58:31,071][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:58:31,531][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:58:32,042][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:58:32,546][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:58:33,051][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:58:33,562][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:58:34,064][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:58:34,576][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:58:35,076][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:58:35,586][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:58:36,101][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:58:36,606][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:58:37,125][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:58:37,629][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:58:38,136][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:58:38,638][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:58:39,146][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:58:39,651][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:58:40,157][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:58:40,662][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:58:41,176][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:58:41,686][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:58:42,194][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:58:42,694][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:58:43,198][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:58:43,694][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:58:44,195][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:58:44,692][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:58:45,194][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:58:45,690][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:58:46,210][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:58:46,710][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:58:47,208][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:58:47,713][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:58:48,207][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:58:48,712][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:58:49,213][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:58:49,713][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:58:50,229][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:58:50,732][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:58:51,234][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:58:51,738][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:58:52,238][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:58:52,739][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:58:53,236][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:58:53,733][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:58:54,243][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:58:54,751][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:58:55,256][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:58:55,762][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:58:56,266][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:58:56,772][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:58:57,275][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:58:57,780][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:58:58,286][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:58:58,796][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:58:59,304][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:58:59,812][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:59:00,320][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:59:00,840][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:59:01,351][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 07:59:01,861][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 07:59:02,364][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 07:59:02,872][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 07:59:03,386][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10060 tokens. [2025-11-13 07:59:04,222][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 07:59:04,999][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 07:59:05,000][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 07:59:05,003][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 07:59:05,984][__main__][INFO] - Iteration 615 took 1m 12s (49.49% Gen, 49.16% Train). Generation: 35s, Training: 35s. Estimated remaining time: 50h 36m 18s. Estimated total time: 60h 25m 43s. Time estimates for 10 more iterations: 12m 5s, 100 more iterations: 2h 0m 51s, 500 more iterations: 10h 4m 17s. [2025-11-13 07:59:05,986][__main__][INFO] - Starting iteration 615. [2025-11-13 07:59:06,489][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 07:59:06,490][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 07:59:28,057][__main__][INFO] - Number of regex retries in iteration 615: 0 [2025-11-13 07:59:28,058][__main__][INFO] - agents played in iteration 615 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 07:59:28,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:59:28,969][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:59:28,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:59:29,019][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 07:59:29,020][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 07:59:29,021][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 07:59:29,885][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 07:59:30,346][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 07:59:30,855][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 07:59:31,360][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 07:59:31,862][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 07:59:32,362][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 07:59:32,859][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 07:59:33,361][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 07:59:33,864][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 07:59:34,365][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 07:59:34,865][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 07:59:35,373][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 07:59:35,872][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 07:59:36,376][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 07:59:36,885][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 07:59:37,387][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 07:59:37,890][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 07:59:38,387][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 07:59:38,888][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 07:59:39,393][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 07:59:39,891][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 07:59:40,390][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 07:59:40,896][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 07:59:41,391][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 07:59:41,886][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 07:59:42,387][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 07:59:42,891][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 07:59:43,404][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 07:59:43,909][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 07:59:44,411][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 07:59:44,911][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 07:59:45,410][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 07:59:45,910][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 07:59:46,410][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 07:59:46,909][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 07:59:47,414][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 07:59:47,912][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 07:59:48,424][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 07:59:48,923][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 07:59:49,423][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 07:59:49,945][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 07:59:50,445][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 07:59:50,943][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 07:59:51,444][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 07:59:51,935][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 07:59:52,444][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 07:59:52,941][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 07:59:53,439][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 07:59:53,951][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 07:59:54,451][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 07:59:54,952][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 07:59:55,449][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 07:59:55,949][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 07:59:56,456][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 07:59:56,954][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 07:59:57,453][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 07:59:57,960][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 07:59:58,458][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 07:59:58,959][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 07:59:59,458][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 07:59:59,962][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:00:00,469][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:00:00,972][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:00:01,476][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:00:01,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10003 tokens. [2025-11-13 08:00:02,814][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.14%, ΔTime: 00:00:32 [2025-11-13 08:00:03,592][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:00:03,598][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:00:03,600][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:00:04,597][__main__][INFO] - Iteration 616 took 58s (37.12% Gen, 61.17% Train). Generation: 21s, Training: 35s. Estimated remaining time: 38h 35m 1s. Estimated total time: 48h 25m 24s. Time estimates for 10 more iterations: 9m 41s, 100 more iterations: 1h 36m 50s, 500 more iterations: 8h 4m 14s. [2025-11-13 08:00:04,599][__main__][INFO] - Starting iteration 616. [2025-11-13 08:00:05,070][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 08:00:05,071][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:00:16,122][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:00:28,252][__main__][INFO] - Number of regex retries in iteration 616: 1 [2025-11-13 08:00:28,254][__main__][INFO] - agents played in iteration 616 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:00:29,106][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:00:29,129][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:00:29,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:00:29,175][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:00:29,175][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:00:29,176][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:00:30,011][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:00:30,474][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:00:30,984][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:00:31,491][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:00:32,001][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:00:32,505][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:00:33,010][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:00:33,522][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:00:34,035][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:00:34,547][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:00:35,058][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:00:35,562][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:00:36,068][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:00:36,577][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:00:37,079][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:00:37,587][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:00:38,092][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:00:38,602][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:00:39,104][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:00:39,606][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:00:40,117][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:00:40,622][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:00:41,130][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:00:41,632][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:00:42,136][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:00:42,639][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:00:43,141][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:00:43,645][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:00:44,149][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:00:44,654][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:00:45,161][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:00:45,662][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:00:46,166][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:00:46,669][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:00:47,177][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:00:47,687][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:00:48,190][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:00:48,695][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:00:49,201][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:00:49,707][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:00:50,210][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:00:50,711][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:00:51,215][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:00:51,719][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:00:52,222][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:00:52,728][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:00:53,230][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:00:53,730][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:00:54,246][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:00:54,748][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:00:55,250][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:00:55,753][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:00:56,252][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:00:56,755][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:00:57,257][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:00:57,761][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:00:58,267][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:00:58,771][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:00:59,283][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:00:59,790][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:01:00,295][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:01:00,802][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:01:01,305][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:01:01,811][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:01:02,314][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10140 tokens. [2025-11-13 08:01:03,207][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.02%, Current % of VRAM taken: 58.27%, Block Peak % of device VRAM: 62.11%, ΔTime: 00:00:33 [2025-11-13 08:01:03,873][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:01:03,875][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:01:03,878][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:01:04,693][__main__][INFO] - Iteration 617 took 59s (38.88% Gen, 59.75% Train). Generation: 23s, Training: 35s. Estimated remaining time: 39h 49m 49s. Estimated total time: 49h 41m 12s. Time estimates for 10 more iterations: 9m 56s, 100 more iterations: 1h 39m 22s, 500 more iterations: 8h 16m 52s. [2025-11-13 08:01:04,696][__main__][INFO] - Starting iteration 617. [2025-11-13 08:01:05,182][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 08:01:05,182][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:01:40,145][__main__][INFO] - Number of regex retries in iteration 617: 0 [2025-11-13 08:01:40,145][__main__][INFO] - agents played in iteration 617 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:01:40,992][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:01:41,018][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:01:41,043][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:01:41,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:01:41,066][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:01:41,066][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:01:41,980][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:01:42,443][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:01:42,967][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:01:43,468][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:01:43,972][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:01:44,474][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:01:44,977][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:01:45,479][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:01:45,981][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:01:46,490][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:01:46,998][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:01:47,504][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:01:48,007][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:01:48,514][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:01:49,018][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:01:49,522][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:01:50,024][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:01:50,534][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:01:51,044][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:01:51,552][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:01:52,059][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:01:52,563][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:01:53,071][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:01:53,573][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:01:54,083][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:01:54,612][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:01:55,117][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:01:55,620][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:01:56,122][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:01:56,621][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:01:57,127][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:01:57,623][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:01:58,124][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:01:58,620][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:01:59,120][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:01:59,622][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:02:00,130][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:02:00,632][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:02:01,145][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:02:01,645][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:02:02,145][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:02:02,665][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:02:03,164][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:02:03,661][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:02:04,159][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:02:04,659][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:02:05,166][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:02:05,665][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:02:06,164][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:02:06,671][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:02:07,172][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:02:07,673][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:02:08,173][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:02:08,673][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:02:09,172][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:02:09,673][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:02:10,175][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:02:10,673][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:02:11,175][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:02:11,674][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:02:12,172][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:02:12,671][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:02:13,184][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:02:13,683][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:02:14,184][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10067 tokens. [2025-11-13 08:02:14,999][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 08:02:15,764][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:02:15,765][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:02:15,767][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:02:16,739][__main__][INFO] - Iteration 618 took 1m 11s (48.86% Gen, 49.78% Train). Generation: 34s, Training: 35s. Estimated remaining time: 49h 45m 18s. Estimated total time: 59h 37m 53s. Time estimates for 10 more iterations: 11m 55s, 100 more iterations: 1h 59m 15s, 500 more iterations: 9h 56m 18s. [2025-11-13 08:02:16,741][__main__][INFO] - Starting iteration 618. [2025-11-13 08:02:17,249][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 08:02:17,250][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:02:42,710][__main__][INFO] - Number of regex retries in iteration 618: 0 [2025-11-13 08:02:42,710][__main__][INFO] - agents played in iteration 618 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:02:43,524][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:02:43,546][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:02:43,568][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:02:43,590][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:02:43,591][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:02:43,592][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:02:44,470][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:02:44,927][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:02:45,439][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:02:45,958][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:02:46,469][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:02:46,973][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:02:47,476][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:02:47,977][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:02:48,484][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:02:48,984][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:02:49,485][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:02:49,988][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:02:50,490][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:02:50,991][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:02:51,493][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:02:51,994][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:02:52,498][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:02:53,001][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:02:53,505][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:02:54,013][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:02:54,520][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:02:55,025][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:02:55,531][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:02:56,036][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:02:56,534][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:02:57,068][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:02:57,579][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:02:58,080][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:02:58,582][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:02:59,082][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:02:59,588][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:03:00,095][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:03:00,597][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:03:01,096][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:03:01,596][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:03:02,099][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:03:02,597][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:03:03,098][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:03:03,598][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:03:04,098][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:03:04,601][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:03:05,106][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:03:05,609][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:03:06,112][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:03:06,613][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:03:07,117][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:03:07,616][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:03:08,121][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:03:08,624][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:03:09,123][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:03:09,623][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:03:10,120][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:03:10,618][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:03:11,114][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:03:11,612][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:03:12,110][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:03:12,607][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:03:13,108][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:03:13,610][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:03:14,110][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:03:14,614][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:03:15,113][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:03:15,607][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:03:16,111][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:03:16,616][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10076 tokens. [2025-11-13 08:03:17,405][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.40%, ΔTime: 00:00:32 [2025-11-13 08:03:18,136][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:03:18,138][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:03:18,139][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:03:19,035][__main__][INFO] - Iteration 619 took 1m 1s (41.21% Gen, 57.34% Train). Generation: 25s, Training: 35s. Estimated remaining time: 41h 35m 43s. Estimated total time: 51h 29m 21s. Time estimates for 10 more iterations: 10m 17s, 100 more iterations: 1h 42m 58s, 500 more iterations: 8h 34m 53s. [2025-11-13 08:03:19,038][__main__][INFO] - Starting iteration 619. [2025-11-13 08:03:19,505][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 08:03:19,506][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:03:47,319][__main__][INFO] - Number of regex retries in iteration 619: 0 [2025-11-13 08:03:47,322][__main__][INFO] - agents played in iteration 619 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:03:48,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:03:48,242][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:03:48,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:03:48,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:03:48,301][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:03:48,302][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:03:49,191][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:03:49,654][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:03:50,162][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:03:50,674][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:03:51,176][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:03:51,699][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:03:52,205][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:03:52,712][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:03:53,221][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:03:53,732][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:03:54,239][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:03:54,745][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:03:55,253][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:03:55,766][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:03:56,268][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:03:56,785][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:03:57,289][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:03:57,784][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:03:58,291][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:03:58,794][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:03:59,303][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:03:59,806][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:04:00,314][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:04:00,819][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:04:01,323][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:04:01,826][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:04:02,327][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:04:02,836][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:04:03,338][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:04:03,834][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:04:04,333][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:04:04,835][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:04:05,336][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:04:05,844][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:04:06,345][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:04:06,848][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:04:07,348][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:04:07,848][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:04:08,351][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:04:08,868][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:04:09,371][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:04:09,881][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:04:10,384][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:04:10,884][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:04:11,390][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:04:11,893][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:04:12,398][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:04:12,900][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:04:13,404][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:04:13,935][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:04:14,442][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:04:14,946][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:04:15,451][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:04:15,955][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:04:16,459][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:04:16,962][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:04:17,472][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:04:17,992][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:04:18,492][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:04:18,992][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:04:19,494][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:04:19,994][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:04:20,506][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:04:21,011][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:04:21,524][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10192 tokens. [2025-11-13 08:04:22,400][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.27%, Current % of VRAM taken: 58.51%, Block Peak % of device VRAM: 62.44%, ΔTime: 00:00:33 [2025-11-13 08:04:23,124][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:04:23,125][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:04:23,127][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:04:24,110][__main__][INFO] - Iteration 620 took 1m 4s (43.06% Gen, 55.42% Train). Generation: 27s, Training: 35s. Estimated remaining time: 43h 55m 33s. Estimated total time: 53h 50m 16s. Time estimates for 10 more iterations: 10m 46s, 100 more iterations: 1h 47m 40s, 500 more iterations: 8h 58m 22s. [2025-11-13 08:04:24,112][__main__][INFO] - Starting iteration 620. [2025-11-13 08:04:24,612][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 61 and human policies 1. [2025-11-13 08:04:24,613][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:04:54,100][__main__][INFO] - Number of regex retries in iteration 620: 0 [2025-11-13 08:04:54,101][__main__][INFO] - agents played in iteration 620 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:04:54,943][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:04:54,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:04:54,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:04:55,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.42%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:04:55,012][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:04:55,012][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:04:55,892][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:04:56,357][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:04:56,867][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:04:57,379][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:04:57,883][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:04:58,395][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:04:58,900][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:04:59,405][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:04:59,924][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:05:00,435][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:05:00,938][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:05:01,438][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:05:01,935][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:05:02,441][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:05:02,942][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:05:03,444][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:05:03,949][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:05:04,455][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:05:04,955][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:05:05,457][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:05:05,959][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:05:06,461][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:05:06,963][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:05:07,464][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:05:07,972][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:05:08,477][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:05:08,989][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:05:09,488][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:05:09,988][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:05:10,497][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:05:10,999][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:05:11,505][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:05:12,002][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:05:12,503][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:05:13,010][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:05:13,515][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:05:14,018][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:05:14,516][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:05:15,014][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:05:15,519][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:05:16,019][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:05:16,519][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:05:17,030][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:05:17,533][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:05:18,038][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:05:18,541][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:05:19,038][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:05:19,539][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:05:20,039][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:05:20,538][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:05:21,032][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:05:21,533][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:05:22,035][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:05:22,531][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:05:23,030][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:05:23,527][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:05:24,026][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:05:24,525][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:05:25,031][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:05:25,536][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:05:26,043][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:05:26,552][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:05:27,055][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:05:27,560][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:05:28,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9933 tokens. [2025-11-13 08:05:28,958][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.07%, Current % of VRAM taken: 58.31%, Block Peak % of device VRAM: 62.17%, ΔTime: 00:00:33 [2025-11-13 08:05:29,714][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:05:29,716][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:05:29,718][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:05:31,534][__main__][INFO] - Iteration 621 took 1m 6s (44.06% Gen, 53.22% Train). Generation: 29s, Training: 35s. Estimated remaining time: 45h 50m 17s. Estimated total time: 55h 46m 7s. Time estimates for 10 more iterations: 11m 9s, 100 more iterations: 1h 51m 32s, 500 more iterations: 9h 17m 41s. [2025-11-13 08:05:31,536][__main__][INFO] - Starting iteration 621. [2025-11-13 08:05:32,008][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 62 and human policies 1. [2025-11-13 08:05:32,009][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:06:00,618][__main__][INFO] - Number of regex retries in iteration 621: 0 [2025-11-13 08:06:00,620][__main__][INFO] - agents played in iteration 621 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:06:01,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:01,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:01,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:01,541][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:06:01,542][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:06:01,542][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:06:02,392][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:06:02,862][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:06:03,373][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:06:03,879][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:06:04,396][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:06:04,899][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:06:05,412][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:06:05,919][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:06:06,423][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:06:06,938][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:06:07,439][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:06:07,943][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:06:08,447][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:06:08,953][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:06:09,458][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:06:09,961][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:06:10,469][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:06:10,976][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:06:11,479][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:06:11,998][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:06:12,506][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:06:13,005][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:06:13,507][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:06:14,011][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:06:14,522][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:06:15,030][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:06:15,533][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:06:16,041][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:06:16,537][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:06:17,040][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:06:17,545][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:06:18,049][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:06:18,556][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:06:19,057][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:06:19,560][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:06:20,064][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:06:20,569][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:06:21,071][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:06:21,573][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:06:22,079][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:06:22,591][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:06:23,094][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:06:23,598][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:06:24,102][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:06:24,606][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:06:25,125][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:06:25,632][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:06:26,138][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:06:26,647][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:06:27,153][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:06:27,656][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:06:28,164][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:06:28,665][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:06:29,161][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:06:29,661][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:06:30,164][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:06:30,662][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:06:31,162][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:06:31,661][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:06:32,166][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:06:32,672][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:06:33,179][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:06:33,685][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:06:34,191][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:06:34,696][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10013 tokens. [2025-11-13 08:06:35,604][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 08:06:36,242][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:06:36,244][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:06:36,245][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:06:37,059][__main__][INFO] - Iteration 622 took 1m 5s (43.98% Gen, 54.76% Train). Generation: 28s, Training: 35s. Estimated remaining time: 44h 15m 40s. Estimated total time: 54h 12m 35s. Time estimates for 10 more iterations: 10m 50s, 100 more iterations: 1h 48m 25s, 500 more iterations: 9h 2m 5s. [2025-11-13 08:06:37,062][__main__][INFO] - Starting iteration 622. [2025-11-13 08:06:37,543][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 62 and human policies 1. [2025-11-13 08:06:37,543][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:07:11,243][__main__][INFO] - Number of regex retries in iteration 622: 0 [2025-11-13 08:07:11,244][__main__][INFO] - agents played in iteration 622 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:07:12,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:12,131][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:12,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:12,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:07:12,192][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:07:12,193][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:07:13,026][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:07:13,488][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:07:13,998][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:07:14,503][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:07:15,034][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:07:15,539][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:07:16,046][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:07:16,552][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:07:17,056][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:07:17,550][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:07:18,048][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:07:18,551][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:07:19,053][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:07:19,553][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:07:20,053][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:07:20,553][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:07:21,053][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:07:21,555][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:07:22,060][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:07:22,563][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:07:23,068][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:07:23,574][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:07:24,079][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:07:24,582][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:07:25,086][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:07:25,591][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:07:26,091][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:07:26,593][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:07:27,112][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:07:27,616][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:07:28,136][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:07:28,639][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:07:29,138][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:07:29,647][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:07:30,139][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:07:30,639][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:07:31,136][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:07:31,637][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:07:32,153][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:07:32,656][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:07:33,163][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:07:33,664][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:07:34,172][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:07:34,671][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:07:35,176][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:07:35,683][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:07:36,194][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:07:36,703][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:07:37,203][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:07:37,709][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:07:38,215][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:07:38,742][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:07:39,238][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:07:39,743][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:07:40,249][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:07:40,753][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:07:41,257][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:07:41,768][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:07:42,276][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:07:42,784][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:07:43,291][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:07:43,797][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:07:44,304][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:07:44,810][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:07:45,341][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10114 tokens. [2025-11-13 08:07:46,200][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.36%, ΔTime: 00:00:33 [2025-11-13 08:07:46,951][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:07:46,952][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:07:46,954][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:07:47,892][__main__][INFO] - Iteration 623 took 1m 10s (47.90% Gen, 50.76% Train). Generation: 33s, Training: 35s. Estimated remaining time: 48h 39m 24s. Estimated total time: 58h 37m 30s. Time estimates for 10 more iterations: 11m 43s, 100 more iterations: 1h 57m 15s, 500 more iterations: 9h 46m 15s. [2025-11-13 08:07:47,896][__main__][INFO] - Starting iteration 623. [2025-11-13 08:07:48,380][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 62 and human policies 1. [2025-11-13 08:07:48,381][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:08:14,772][__main__][INFO] - Number of regex retries in iteration 623: 0 [2025-11-13 08:08:14,773][__main__][INFO] - agents played in iteration 623 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:08:15,574][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:15,597][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:15,619][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:15,642][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:08:15,643][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:08:15,644][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:08:16,446][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:08:16,914][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:08:17,419][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:08:17,925][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:08:18,425][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:08:18,928][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:08:19,432][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:08:19,938][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:08:20,443][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:08:20,947][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:08:21,448][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:08:21,962][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:08:22,466][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:08:22,971][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:08:23,474][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:08:23,978][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:08:24,482][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:08:24,982][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:08:25,481][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:08:25,986][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:08:26,488][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:08:26,990][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:08:27,491][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:08:27,997][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:08:28,500][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:08:29,004][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:08:29,505][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:08:30,005][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:08:30,505][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:08:31,007][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:08:31,511][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:08:32,008][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:08:32,514][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:08:33,014][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:08:33,512][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:08:34,025][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:08:34,525][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:08:35,024][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:08:35,535][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:08:36,037][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:08:36,557][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:08:37,062][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:08:37,568][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:08:38,069][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:08:38,579][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:08:39,090][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:08:39,599][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:08:40,110][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:08:40,618][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:08:41,127][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:08:41,647][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:08:42,148][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:08:42,657][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:08:43,162][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:08:43,663][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:08:44,171][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:08:44,678][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:08:45,182][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:08:45,692][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:08:46,198][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:08:47,512][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:08:48,383][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:08:48,890][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:08:49,396][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:08:49,902][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10067 tokens. [2025-11-13 08:08:50,860][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.35%, ΔTime: 00:00:34 [2025-11-13 08:08:51,566][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:08:51,568][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:08:51,570][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:08:52,417][__main__][INFO] - Iteration 624 took 1m 4s (41.21% Gen, 57.46% Train). Generation: 26s, Training: 36s. Estimated remaining time: 43h 22m 43s. Estimated total time: 53h 21m 53s. Time estimates for 10 more iterations: 10m 40s, 100 more iterations: 1h 46m 43s, 500 more iterations: 8h 53m 38s. [2025-11-13 08:08:52,421][__main__][INFO] - Starting iteration 624. [2025-11-13 08:08:52,903][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 62 and human policies 1. [2025-11-13 08:08:52,904][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:09:28,536][__main__][INFO] - Number of regex retries in iteration 624: 0 [2025-11-13 08:09:28,537][__main__][INFO] - agents played in iteration 624 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:09:29,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:29,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:29,427][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:29,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:09:29,450][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:09:29,451][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:09:30,316][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:09:30,784][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:09:31,299][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:09:31,807][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:09:32,315][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:09:32,815][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:09:33,318][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:09:33,826][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:09:34,328][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:09:34,821][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:09:35,335][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:09:35,839][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:09:36,344][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:09:36,837][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:09:37,341][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:09:37,848][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:09:38,350][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:09:38,862][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:09:39,364][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:09:39,868][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:09:40,372][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:09:40,879][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:09:41,387][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:09:41,891][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:09:42,396][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:09:42,901][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:09:43,399][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:09:43,902][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:09:44,408][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:09:44,906][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:09:45,413][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:09:45,915][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:09:46,420][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:09:46,925][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:09:47,429][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:09:47,930][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:09:48,441][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:09:48,952][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:09:49,469][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:09:49,968][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:09:50,468][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:09:50,976][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:09:51,475][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:09:51,982][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:09:52,486][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:09:52,992][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:09:53,498][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:09:53,996][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:09:54,500][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:09:55,001][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:09:55,506][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:09:56,006][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:09:56,512][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:09:57,021][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:09:57,521][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:09:58,023][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:09:58,527][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:09:59,033][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:09:59,537][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:10:00,042][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:10:00,545][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:10:01,048][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:10:01,550][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:10:02,057][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:10:02,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9993 tokens. [2025-11-13 08:10:03,390][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 08:10:04,176][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:10:04,177][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:10:04,179][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:10:05,186][__main__][INFO] - Iteration 625 took 1m 12s (49.30% Gen, 49.31% Train). Generation: 35s, Training: 35s. Estimated remaining time: 50h 13m 46s. Estimated total time: 60h 14m 10s. Time estimates for 10 more iterations: 12m 2s, 100 more iterations: 2h 0m 28s, 500 more iterations: 10h 2m 21s. [2025-11-13 08:10:05,188][__main__][INFO] - Starting iteration 625. [2025-11-13 08:10:05,684][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 62 and human policies 1. [2025-11-13 08:10:05,685][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:10:31,669][__main__][INFO] - Number of regex retries in iteration 625: 0 [2025-11-13 08:10:31,670][__main__][INFO] - agents played in iteration 625 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:10:32,499][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:32,530][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:32,552][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:32,574][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:10:32,575][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:10:32,575][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:10:33,360][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:10:33,812][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:10:34,321][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:10:34,823][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:10:35,327][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:10:35,831][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:10:36,335][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:10:36,836][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:10:37,336][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:10:37,836][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:10:38,336][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:10:38,842][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:10:39,342][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:10:39,840][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:10:40,340][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:10:40,851][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:10:41,353][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:10:41,853][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:10:42,352][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:10:42,880][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:10:43,399][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:10:43,907][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:10:44,408][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:10:44,914][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:10:45,419][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:10:45,928][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:10:46,433][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:10:46,938][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:10:47,451][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:10:47,958][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:10:48,464][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:10:48,980][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:10:49,492][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:10:50,005][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:10:50,508][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:10:51,014][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:10:51,517][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:10:52,019][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:10:52,525][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:10:53,028][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:10:53,531][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:10:54,039][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:10:54,539][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:10:55,038][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:10:55,538][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:10:56,041][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:10:56,541][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:10:57,047][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:10:57,547][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:10:58,052][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:10:58,556][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:10:59,059][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:10:59,572][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:11:00,074][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:11:00,592][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:11:01,093][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:11:01,594][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:11:02,091][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:11:02,597][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:11:03,097][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:11:03,596][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:11:04,089][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:11:04,599][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:11:05,100][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:11:05,596][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9936 tokens. [2025-11-13 08:11:06,406][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.92%, Current % of VRAM taken: 58.17%, Block Peak % of device VRAM: 62.43%, ΔTime: 00:00:33 [2025-11-13 08:11:07,179][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:11:07,181][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:11:07,183][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:11:08,140][__main__][INFO] - Iteration 626 took 1m 2s (41.61% Gen, 56.86% Train). Generation: 25s, Training: 35s. Estimated remaining time: 42h 1m 21s. Estimated total time: 52h 2m 48s. Time estimates for 10 more iterations: 10m 24s, 100 more iterations: 1h 44m 5s, 500 more iterations: 8h 40m 28s. [2025-11-13 08:11:08,141][__main__][INFO] - Starting iteration 626. [2025-11-13 08:11:08,629][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 62 and human policies 1. [2025-11-13 08:11:08,629][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:11:34,589][__main__][INFO] - Number of regex retries in iteration 626: 0 [2025-11-13 08:11:34,591][__main__][INFO] - agents played in iteration 626 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:11:35,501][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:35,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:35,563][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:35,588][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:11:35,589][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:11:35,590][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:11:36,454][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:11:36,919][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:11:37,433][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:11:37,941][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:11:38,439][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:11:38,934][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:11:39,432][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:11:39,934][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:11:40,434][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:11:40,936][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:11:41,444][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:11:41,945][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:11:42,450][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:11:42,952][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:11:43,456][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:11:43,971][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:11:44,472][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:11:44,995][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:11:45,494][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:11:46,004][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:11:46,523][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:11:47,038][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:11:47,552][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:11:48,066][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:11:48,583][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:11:49,097][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:11:49,596][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:11:50,098][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:11:50,608][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:11:51,115][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:11:51,622][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:11:52,126][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:11:52,628][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:11:53,133][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:11:53,638][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:11:54,140][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:11:54,639][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:11:55,141][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:11:55,649][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:11:56,149][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:11:56,653][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:11:57,170][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:11:57,678][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:11:58,193][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:11:58,697][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:11:59,202][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:11:59,713][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:12:00,218][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:12:00,725][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:12:01,228][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:12:01,735][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:12:02,243][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:12:02,751][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:12:03,256][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:12:03,774][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:12:04,280][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:12:04,794][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:12:05,305][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:12:05,812][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:12:06,320][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:12:06,828][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:12:07,335][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:12:07,840][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:12:08,342][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:12:08,846][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10020 tokens. [2025-11-13 08:12:09,740][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 08:12:10,396][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:12:10,398][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:12:10,400][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:12:11,224][__main__][INFO] - Iteration 627 took 1m 2s (41.47% Gen, 57.21% Train). Generation: 25s, Training: 35s. Estimated remaining time: 42h 7m 18s. Estimated total time: 52h 9m 48s. Time estimates for 10 more iterations: 10m 25s, 100 more iterations: 1h 44m 19s, 500 more iterations: 8h 41m 38s. [2025-11-13 08:12:11,226][__main__][INFO] - Starting iteration 627. [2025-11-13 08:12:11,718][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 62 and human policies 1. [2025-11-13 08:12:11,719][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:12:36,608][__main__][INFO] - Number of regex retries in iteration 627: 0 [2025-11-13 08:12:36,609][__main__][INFO] - agents played in iteration 627 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:12:37,544][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:37,573][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:37,602][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:37,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:12:37,627][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:12:37,629][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:12:38,447][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:12:38,904][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:12:39,411][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:12:39,912][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:12:40,420][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:12:40,927][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:12:41,428][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:12:41,931][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:12:42,431][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:12:42,932][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:12:43,434][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:12:43,932][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:12:44,434][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:12:44,937][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:12:45,438][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:12:45,945][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:12:46,445][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:12:46,945][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:12:47,443][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:12:47,943][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:12:48,444][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:12:48,942][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:12:49,441][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:12:49,940][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:12:50,442][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:12:50,941][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:12:51,439][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:12:51,937][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:12:52,442][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:12:52,940][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:12:53,442][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:12:53,940][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:12:54,439][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:12:54,939][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:12:55,438][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:12:55,937][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:12:56,439][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:12:56,935][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:12:57,432][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:12:57,938][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:12:58,447][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:12:58,950][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:12:59,459][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:12:59,966][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:13:00,474][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:13:00,979][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:13:01,494][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:13:01,993][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:13:02,496][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:13:03,003][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:13:03,509][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:13:04,020][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:13:04,528][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:13:05,038][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:13:05,553][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:13:06,062][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:13:06,568][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:13:07,072][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:13:07,578][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:13:08,099][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:13:08,608][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:13:09,118][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:13:09,621][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:13:10,131][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:13:10,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10089 tokens. [2025-11-13 08:13:11,560][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.36%, ΔTime: 00:00:33 [2025-11-13 08:13:12,315][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:13:12,317][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:13:12,320][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:13:13,307][__main__][INFO] - Iteration 628 took 1m 1s (40.41% Gen, 57.98% Train). Generation: 24s, Training: 35s. Estimated remaining time: 41h 15m 58s. Estimated total time: 51h 19m 30s. Time estimates for 10 more iterations: 10m 15s, 100 more iterations: 1h 42m 39s, 500 more iterations: 8h 33m 15s. [2025-11-13 08:13:13,310][__main__][INFO] - Starting iteration 628. [2025-11-13 08:13:13,831][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 62 and human policies 1. [2025-11-13 08:13:13,831][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:13:37,063][__main__][INFO] - Number of regex retries in iteration 628: 0 [2025-11-13 08:13:37,064][__main__][INFO] - agents played in iteration 628 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:13:38,013][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:38,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:38,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:38,087][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:13:38,088][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:13:38,089][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:13:38,969][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:13:39,430][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:13:39,940][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:13:40,445][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:13:40,954][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:13:41,454][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:13:41,956][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:13:42,458][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:13:42,962][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:13:43,477][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:13:43,971][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:13:44,467][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:13:44,974][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:13:45,473][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:13:45,977][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:13:46,473][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:13:46,972][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:13:47,476][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:13:47,981][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:13:48,484][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:13:48,989][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:13:49,491][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:13:50,010][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:13:50,513][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:13:51,016][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:13:51,519][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:13:52,018][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:13:52,519][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:13:53,019][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:13:53,518][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:13:54,021][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:13:54,524][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:13:55,023][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:13:55,526][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:13:56,030][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:13:56,533][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:13:57,035][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:13:57,543][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:13:58,047][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:13:58,551][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:13:59,051][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:13:59,554][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:14:00,058][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:14:00,561][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:14:01,064][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:14:01,568][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:14:02,071][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:14:02,574][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:14:03,091][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:14:03,632][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:14:04,186][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:14:04,689][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:14:05,194][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:14:05,704][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:14:06,210][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:14:06,731][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:14:07,237][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:14:07,743][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:14:08,243][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:14:08,749][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:14:09,255][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:14:09,762][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:14:10,266][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:14:10,775][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:14:11,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10027 tokens. [2025-11-13 08:14:12,202][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.16%, Current % of VRAM taken: 58.41%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 08:14:12,854][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:14:12,855][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:14:12,858][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:14:13,749][__main__][INFO] - Iteration 629 took 59s (38.77% Gen, 59.74% Train). Generation: 23s, Training: 35s. Estimated remaining time: 39h 51m 25s. Estimated total time: 49h 55m 57s. Time estimates for 10 more iterations: 9m 59s, 100 more iterations: 1h 39m 51s, 500 more iterations: 8h 19m 19s. [2025-11-13 08:14:13,751][__main__][INFO] - Starting iteration 629. [2025-11-13 08:14:14,281][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 62 and human policies 1. [2025-11-13 08:14:14,282][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:14:33,944][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 15 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:14:43,451][__main__][INFO] - Number of regex retries in iteration 629: 1 [2025-11-13 08:14:43,452][__main__][INFO] - agents played in iteration 629 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:14:44,316][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:44,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:44,373][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:44,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:14:44,396][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:14:44,397][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:14:45,225][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:14:45,680][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:14:46,199][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:14:46,697][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:14:47,190][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:14:47,705][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:14:48,203][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:14:48,699][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:14:49,195][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:14:49,691][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:14:50,207][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:14:50,708][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:14:51,215][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:14:51,720][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:14:52,220][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:14:52,723][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:14:53,227][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:14:53,723][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:14:54,229][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:14:54,730][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:14:55,233][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:14:55,731][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:14:56,225][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:14:56,720][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:14:57,213][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:14:57,713][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:14:58,224][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:14:58,725][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:14:59,226][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:14:59,728][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:15:00,254][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:15:00,758][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:15:01,261][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:15:01,758][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:15:02,257][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:15:02,757][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:15:03,255][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:15:03,752][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:15:04,250][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:15:04,749][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:15:05,251][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:15:05,747][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:15:06,242][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:15:06,737][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:15:07,242][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:15:07,734][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:15:08,230][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:15:08,739][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:15:09,244][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:15:09,746][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:15:10,252][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:15:10,755][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:15:11,265][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:15:11,769][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:15:12,275][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:15:12,781][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:15:13,287][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:15:13,793][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:15:14,297][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:15:14,799][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:15:15,316][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:15:15,823][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:15:16,322][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:15:16,826][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:15:17,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9862 tokens. [2025-11-13 08:15:18,223][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.21%, ΔTime: 00:00:33 [2025-11-13 08:15:19,004][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:15:19,005][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:15:19,008][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:15:19,938][__main__][INFO] - Iteration 630 took 1m 5s (44.43% Gen, 54.15% Train). Generation: 29s, Training: 35s. Estimated remaining time: 44h 37m 12s. Estimated total time: 54h 42m 50s. Time estimates for 10 more iterations: 10m 56s, 100 more iterations: 1h 49m 25s, 500 more iterations: 9h 7m 8s. [2025-11-13 08:15:19,940][__main__][INFO] - Starting iteration 630. [2025-11-13 08:15:20,429][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 62 and human policies 1. [2025-11-13 08:15:20,430][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:15:54,564][__main__][INFO] - Number of regex retries in iteration 630: 0 [2025-11-13 08:15:54,565][__main__][INFO] - agents played in iteration 630 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:15:55,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:55,433][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:55,461][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:55,485][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:15:55,486][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:15:55,486][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:15:56,259][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:15:56,715][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:15:57,212][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:15:57,717][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:15:58,211][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:15:58,730][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:15:59,233][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:15:59,737][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:16:00,235][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:16:00,733][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:16:01,243][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:16:01,738][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:16:02,238][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:16:02,740][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:16:03,234][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:16:03,736][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:16:04,232][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:16:04,733][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:16:05,238][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:16:05,739][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:16:06,238][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:16:06,743][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:16:07,240][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:16:07,737][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:16:08,237][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:16:08,745][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:16:09,244][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:16:09,747][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:16:10,255][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:16:10,759][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:16:11,261][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:16:11,760][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:16:12,269][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:16:12,775][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:16:13,282][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:16:13,788][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:16:14,294][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:16:14,802][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:16:15,309][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:16:15,836][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:16:16,347][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:16:16,855][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:16:17,356][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:16:17,863][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:16:18,364][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:16:18,863][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:16:19,366][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:16:19,876][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:16:20,377][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:16:20,874][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:16:21,382][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:16:21,890][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:16:22,396][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:16:22,892][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:16:23,398][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:16:23,910][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:16:24,412][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:16:24,936][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:16:25,436][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:16:25,939][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:16:26,436][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:16:26,939][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:16:27,440][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:16:27,942][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:16:28,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9939 tokens. [2025-11-13 08:16:30,007][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 08:16:31,208][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:16:31,210][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:16:31,212][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:16:32,971][__main__][INFO] - Iteration 631 took 1m 12s (47.06% Gen, 50.52% Train). Generation: 34s, Training: 36s. Estimated remaining time: 50h 20m 14s. Estimated total time: 60h 27m 5s. Time estimates for 10 more iterations: 12m 5s, 100 more iterations: 2h 0m 54s, 500 more iterations: 10h 4m 30s. [2025-11-13 08:16:32,973][__main__][INFO] - Starting iteration 631. [2025-11-13 08:16:33,465][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 63 and human policies 1. [2025-11-13 08:16:33,467][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:16:58,423][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:17:08,355][__main__][INFO] - Number of regex retries in iteration 631: 1 [2025-11-13 08:17:08,356][__main__][INFO] - agents played in iteration 631 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:17:09,224][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:09,249][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:09,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:09,298][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:17:09,299][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:17:09,299][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:17:10,149][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:17:10,630][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:17:11,131][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:17:11,638][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:17:12,141][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:17:12,653][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:17:13,156][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:17:13,658][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:17:14,161][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:17:14,667][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:17:15,173][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:17:15,676][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:17:16,180][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:17:16,685][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:17:17,191][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:17:17,696][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:17:18,199][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:17:18,717][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:17:19,220][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:17:19,736][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:17:20,239][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:17:20,748][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:17:21,250][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:17:21,764][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:17:22,271][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:17:22,775][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:17:23,281][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:17:23,786][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:17:24,290][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:17:24,791][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:17:25,296][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:17:25,793][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:17:26,292][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:17:26,798][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:17:27,300][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:17:27,814][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:17:28,320][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:17:28,834][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:17:29,339][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:17:29,842][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:17:30,354][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:17:30,863][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:17:31,368][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:17:31,873][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:17:32,381][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:17:32,886][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:17:33,393][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:17:33,900][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:17:34,416][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:17:34,913][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:17:35,417][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:17:35,915][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:17:36,414][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:17:36,914][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:17:37,411][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:17:37,909][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:17:38,420][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:17:38,925][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:17:39,447][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:17:39,953][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:17:40,458][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:17:40,960][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:17:41,466][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:17:41,975][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:17:42,470][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9751 tokens. [2025-11-13 08:17:43,291][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.97%, Current % of VRAM taken: 58.22%, Block Peak % of device VRAM: 62.22%, ΔTime: 00:00:33 [2025-11-13 08:17:44,057][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:17:44,059][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:17:44,061][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:17:45,033][__main__][INFO] - Iteration 632 took 1m 11s (48.75% Gen, 49.89% Train). Generation: 34s, Training: 35s. Estimated remaining time: 49h 30m 21s. Estimated total time: 59h 38m 25s. Time estimates for 10 more iterations: 11m 55s, 100 more iterations: 1h 59m 16s, 500 more iterations: 9h 56m 24s. [2025-11-13 08:17:45,035][__main__][INFO] - Starting iteration 632. [2025-11-13 08:17:45,510][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 63 and human policies 1. [2025-11-13 08:17:45,511][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:18:05,819][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 0 balls (corrected: 0 hats, 0 books, 10 balls) did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:18:07,988][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:18:10,766][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 0 balls (incorrect response, let me correct this) Given the high value for books and balls, I should propose to take as many of these items as possible. Proposal: 0 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 2/3 [2025-11-13 08:18:12,540][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 0 balls Wait, let's reconsider. Given our values, we should each try to maximize our points by taking items we value highly while ensuring we don't overpropose relative to the items available. Let's propose: Proposal: 0 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:18:12,910][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 0 balls This proposal might seem counter-intuitive, but given that both鲍勃和我对于帽子和书的价值评估较高,而对球的价值评估较低,我们可以假设对方可能会采取类似的策略来最大化自己的得分。通过这样分配,我们可以确保在两者都更重视的物品上不进行竞争,从而可能获得更高的总分。当然,实际策略可能需要根据具体的游戏情况和对对手行为的进一步了解进行调整。 did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:18:14,169][__main__][INFO] - Number of regex retries in iteration 632: 5 [2025-11-13 08:18:14,170][__main__][INFO] - agents played in iteration 632 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:18:15,041][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:15,067][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:15,093][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:15,116][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:18:15,116][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:18:15,118][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:18:15,919][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:18:16,379][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:18:16,881][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:18:17,379][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:18:17,881][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:18:18,385][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:18:18,886][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:18:19,387][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:18:19,887][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:18:20,396][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:18:20,902][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:18:21,410][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:18:21,919][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:18:22,425][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:18:22,944][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:18:23,451][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:18:23,961][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:18:24,490][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:18:25,001][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:18:25,520][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:18:26,023][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:18:26,532][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:18:27,047][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:18:27,552][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:18:28,068][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:18:28,573][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:18:29,081][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:18:29,588][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:18:30,097][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:18:30,600][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:18:31,105][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:18:31,610][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:18:32,120][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:18:32,622][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:18:33,125][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:18:33,644][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:18:34,153][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:18:34,659][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:18:35,163][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:18:35,663][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:18:36,172][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:18:36,671][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:18:37,169][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:18:37,672][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:18:38,173][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:18:38,674][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:18:39,179][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:18:39,681][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:18:40,182][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:18:40,687][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:18:41,189][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:18:41,689][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:18:43,446][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:18:44,017][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:18:44,530][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:18:45,037][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:18:45,548][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:18:46,051][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:18:46,558][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:18:47,056][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:18:47,559][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:18:48,058][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:18:48,554][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:18:49,057][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:18:49,554][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9883 tokens. [2025-11-13 08:18:50,386][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.94%, Current % of VRAM taken: 58.18%, Block Peak % of device VRAM: 62.18%, ΔTime: 00:00:34 [2025-11-13 08:18:51,063][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:18:51,065][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:18:51,066][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:18:51,897][__main__][INFO] - Iteration 633 took 1m 6s (43.17% Gen, 55.58% Train). Generation: 28s, Training: 36s. Estimated remaining time: 45h 10m 12s. Estimated total time: 55h 19m 22s. Time estimates for 10 more iterations: 11m 3s, 100 more iterations: 1h 50m 38s, 500 more iterations: 9h 13m 13s. [2025-11-13 08:18:51,900][__main__][INFO] - Starting iteration 633. [2025-11-13 08:18:52,380][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 63 and human policies 1. [2025-11-13 08:18:52,380][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:19:25,432][__main__][INFO] - Number of regex retries in iteration 633: 0 [2025-11-13 08:19:25,433][__main__][INFO] - agents played in iteration 633 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:19:26,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:26,300][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:26,322][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:26,345][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:19:26,346][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:19:26,346][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:19:27,212][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:19:27,680][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:19:28,192][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:19:28,701][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:19:29,211][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:19:29,715][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:19:30,229][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:19:30,742][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:19:31,248][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:19:31,756][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:19:32,259][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:19:32,778][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:19:33,289][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:19:33,795][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:19:34,297][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:19:34,798][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:19:35,303][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:19:35,808][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:19:36,314][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:19:36,824][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:19:37,329][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:19:37,837][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:19:38,341][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:19:38,849][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:19:39,359][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:19:39,858][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:19:40,359][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:19:40,867][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:19:41,369][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:19:41,883][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:19:42,381][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:19:42,886][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:19:43,396][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:19:43,900][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:19:44,405][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:19:44,911][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:19:45,421][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:19:45,928][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:19:46,438][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:19:46,947][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:19:47,458][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:19:47,962][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:19:48,467][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:19:48,972][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:19:49,478][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:19:49,983][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:19:50,489][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:19:51,000][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:19:51,512][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:19:52,017][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:19:52,523][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:19:53,023][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:19:53,523][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:19:54,027][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:19:54,525][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:19:55,027][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:19:55,531][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:19:56,035][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:19:56,539][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:19:57,041][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:19:57,556][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:19:58,058][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:19:58,583][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:19:59,080][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:19:59,581][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10003 tokens. [2025-11-13 08:20:00,406][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.17%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 08:20:01,197][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:20:01,198][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:20:01,200][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:20:02,130][__main__][INFO] - Iteration 634 took 1m 9s (47.38% Gen, 51.28% Train). Generation: 33s, Training: 35s. Estimated remaining time: 47h 57m 14s. Estimated total time: 58h 7m 34s. Time estimates for 10 more iterations: 11m 37s, 100 more iterations: 1h 56m 15s, 500 more iterations: 9h 41m 15s. [2025-11-13 08:20:02,132][__main__][INFO] - Starting iteration 634. [2025-11-13 08:20:02,615][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 63 and human policies 1. [2025-11-13 08:20:02,615][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:20:21,161][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 20 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:20:31,474][__main__][INFO] - Number of regex retries in iteration 634: 1 [2025-11-13 08:20:31,474][__main__][INFO] - agents played in iteration 634 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:20:32,318][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:32,341][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:32,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:32,386][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:20:32,387][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:20:32,388][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:20:33,280][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:20:33,742][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:20:34,267][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:20:34,773][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:20:35,289][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:20:35,800][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:20:36,301][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:20:36,808][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:20:37,310][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:20:37,810][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:20:38,311][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:20:38,816][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:20:39,321][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:20:39,823][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:20:40,334][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:20:40,842][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:20:41,347][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:20:41,851][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:20:42,351][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:20:42,852][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:20:43,357][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:20:43,867][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:20:44,367][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:20:44,878][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:20:45,390][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:20:45,891][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:20:46,397][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:20:46,898][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:20:47,404][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:20:47,901][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:20:48,400][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:20:48,896][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:20:49,404][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:20:49,909][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:20:50,406][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:20:50,903][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:20:51,403][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:20:51,905][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:20:52,414][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:20:52,911][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:20:53,413][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:20:53,918][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:20:54,422][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:20:54,924][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:20:55,422][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:20:55,923][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:20:56,425][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:20:56,921][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:20:57,424][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:20:57,933][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:20:58,437][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:20:58,940][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:20:59,442][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:20:59,942][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:21:00,446][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:21:00,940][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:21:01,440][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:21:01,947][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:21:02,445][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:21:02,941][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:21:03,444][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:21:03,943][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:21:04,450][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:21:04,953][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:21:05,450][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9692 tokens. [2025-11-13 08:21:06,343][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.92%, Current % of VRAM taken: 58.17%, Block Peak % of device VRAM: 62.14%, ΔTime: 00:00:33 [2025-11-13 08:21:07,136][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:21:07,138][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:21:07,140][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:21:08,057][__main__][INFO] - Iteration 635 took 1m 5s (44.10% Gen, 54.50% Train). Generation: 28s, Training: 35s. Estimated remaining time: 44h 20m 42s. Estimated total time: 54h 32m 9s. Time estimates for 10 more iterations: 10m 54s, 100 more iterations: 1h 49m 4s, 500 more iterations: 9h 5m 21s. [2025-11-13 08:21:08,059][__main__][INFO] - Starting iteration 635. [2025-11-13 08:21:08,545][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 63 and human policies 1. [2025-11-13 08:21:08,546][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:21:28,003][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:21:37,064][__main__][INFO] - Number of regex retries in iteration 635: 1 [2025-11-13 08:21:37,064][__main__][INFO] - agents played in iteration 635 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:21:37,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:37,883][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:37,905][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:37,928][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:21:37,928][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:21:37,929][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:21:38,778][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:21:39,231][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:21:39,739][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:21:40,254][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:21:41,962][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:21:42,529][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:21:43,037][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:21:43,539][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:21:44,039][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:21:44,548][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:21:45,047][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:21:45,547][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:21:46,045][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:21:46,550][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:21:47,048][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:21:47,551][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:21:48,054][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:21:48,559][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:21:49,061][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:21:49,561][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:21:50,058][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:21:50,562][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:21:51,069][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:21:51,573][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:21:52,072][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:21:52,577][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:21:53,098][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:21:53,601][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:21:54,100][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:21:54,604][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:21:55,109][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:21:55,615][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:21:56,115][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:21:56,622][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:21:57,128][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:21:57,635][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:21:58,141][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:21:58,643][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:21:59,152][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:21:59,663][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:22:00,168][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:22:00,672][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:22:01,174][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:22:01,678][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:22:02,209][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:22:02,716][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:22:03,220][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:22:03,726][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:22:04,231][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:22:04,735][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:22:05,242][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:22:05,746][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:22:06,253][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:22:06,755][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:22:07,257][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:22:07,758][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:22:08,264][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:22:08,776][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:22:09,281][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:22:09,779][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:22:10,288][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:22:10,790][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:22:11,303][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:22:11,807][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:22:12,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9867 tokens. [2025-11-13 08:22:13,149][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.01%, Current % of VRAM taken: 58.26%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:34 [2025-11-13 08:22:13,868][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:22:13,870][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:22:13,872][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:22:14,874][__main__][INFO] - Iteration 636 took 1m 6s (42.99% Gen, 55.49% Train). Generation: 28s, Training: 36s. Estimated remaining time: 45h 3m 54s. Estimated total time: 55h 16m 27s. Time estimates for 10 more iterations: 11m 3s, 100 more iterations: 1h 50m 32s, 500 more iterations: 9h 12m 44s. [2025-11-13 08:22:14,876][__main__][INFO] - Starting iteration 636. [2025-11-13 08:22:15,371][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 63 and human policies 1. [2025-11-13 08:22:15,372][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:22:47,146][__main__][INFO] - Number of regex retries in iteration 636: 0 [2025-11-13 08:22:47,147][__main__][INFO] - agents played in iteration 636 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:22:47,988][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:48,014][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:48,039][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:48,065][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:22:48,066][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:22:48,067][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:22:49,005][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:22:49,463][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:22:49,974][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:22:50,481][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:22:50,996][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:22:51,496][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:22:51,995][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:22:52,506][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:22:53,012][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:22:53,518][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:22:54,027][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:22:54,534][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:22:55,039][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:22:55,539][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:22:56,044][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:22:56,542][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:22:57,049][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:22:57,553][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:22:58,054][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:22:58,561][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:22:59,063][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:22:59,563][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:23:00,065][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:23:00,560][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:23:01,058][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:23:01,566][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:23:02,072][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:23:02,581][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:23:03,080][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:23:03,586][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:23:04,104][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:23:04,604][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:23:05,109][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:23:05,618][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:23:06,121][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:23:06,636][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:23:07,139][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:23:07,639][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:23:08,136][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:23:08,633][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:23:09,143][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:23:09,649][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:23:10,153][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:23:10,662][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:23:11,165][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:23:11,669][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:23:12,177][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:23:12,677][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:23:13,174][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:23:13,675][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:23:14,176][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:23:14,683][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:23:15,183][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:23:15,681][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:23:16,181][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:23:16,683][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:23:17,187][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:23:17,688][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:23:18,189][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:23:18,690][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:23:19,187][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:23:19,680][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:23:20,179][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:23:20,678][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:23:21,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9783 tokens. [2025-11-13 08:23:22,028][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.08%, ΔTime: 00:00:33 [2025-11-13 08:23:22,797][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:23:22,798][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:23:22,800][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:23:23,719][__main__][INFO] - Iteration 637 took 1m 8s (46.49% Gen, 52.16% Train). Generation: 31s, Training: 35s. Estimated remaining time: 46h 43m 42s. Estimated total time: 56h 57m 24s. Time estimates for 10 more iterations: 11m 23s, 100 more iterations: 1h 53m 54s, 500 more iterations: 9h 29m 34s. [2025-11-13 08:23:23,721][__main__][INFO] - Starting iteration 637. [2025-11-13 08:23:24,190][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 63 and human policies 1. [2025-11-13 08:23:24,192][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:23:52,557][__main__][INFO] - Number of regex retries in iteration 637: 0 [2025-11-13 08:23:52,558][__main__][INFO] - agents played in iteration 637 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:23:53,411][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:53,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:53,465][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:53,489][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:23:53,489][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:23:53,491][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:23:54,270][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:23:54,736][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:23:55,241][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:23:55,744][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:23:56,245][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:23:56,746][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:23:57,252][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:23:57,759][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:23:58,263][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:23:58,773][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:23:59,281][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:23:59,797][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:24:00,308][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:24:00,840][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:24:01,344][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:24:01,850][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:24:02,360][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:24:02,865][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:24:03,374][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:24:03,885][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:24:04,386][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:24:04,889][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:24:05,404][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:24:05,908][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:24:06,427][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:24:06,935][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:24:07,446][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:24:07,944][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:24:08,449][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:24:08,958][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:24:09,466][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:24:09,975][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:24:10,475][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:24:10,972][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:24:11,466][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:24:11,969][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:24:12,472][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:24:12,973][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:24:13,476][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:24:13,969][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:24:14,472][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:24:14,969][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:24:15,474][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:24:15,971][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:24:16,475][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:24:16,981][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:24:17,477][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:24:17,985][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:24:18,501][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:24:19,001][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:24:19,510][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:24:20,005][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:24:20,500][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:24:21,003][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:24:21,503][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:24:22,004][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:24:22,509][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:24:23,009][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:24:23,523][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:24:24,022][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:24:24,522][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:24:25,023][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:24:25,527][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:24:26,042][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:24:26,539][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9869 tokens. [2025-11-13 08:24:27,357][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.04%, Current % of VRAM taken: 58.29%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:33 [2025-11-13 08:24:28,117][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:24:28,119][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:24:28,121][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:24:29,164][__main__][INFO] - Iteration 638 took 1m 4s (43.66% Gen, 54.73% Train). Generation: 28s, Training: 35s. Estimated remaining time: 43h 53m 57s. Estimated total time: 54h 8m 45s. Time estimates for 10 more iterations: 10m 49s, 100 more iterations: 1h 48m 17s, 500 more iterations: 9h 1m 27s. [2025-11-13 08:24:29,166][__main__][INFO] - Starting iteration 638. [2025-11-13 08:24:29,675][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 63 and human policies 1. [2025-11-13 08:24:29,676][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:24:54,132][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 0 balls + propose taking all hats, all books, and all balls to maximize points based on my values. Given the values, I will propose: 10 hats, 10 books, 10 balls. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:25:00,343][__main__][INFO] - Number of regex retries in iteration 638: 1 [2025-11-13 08:25:00,344][__main__][INFO] - agents played in iteration 638 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:25:01,250][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:01,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:01,326][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:01,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:25:01,366][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:25:01,367][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:25:02,263][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:25:02,728][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:25:03,233][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:25:03,745][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:25:04,253][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:25:04,756][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:25:05,261][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:25:05,762][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:25:06,267][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:25:06,772][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:25:07,280][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:25:07,785][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:25:08,289][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:25:08,792][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:25:09,291][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:25:09,800][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:25:10,300][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:25:10,830][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:25:11,334][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:25:11,841][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:25:12,343][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:25:12,851][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:25:13,357][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:25:13,859][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:25:14,364][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:25:14,882][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:25:15,407][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:25:15,912][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:25:16,415][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:25:16,922][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:25:17,439][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:25:17,947][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:25:18,453][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:25:18,952][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:25:19,454][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:25:19,961][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:25:20,460][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:25:20,970][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:25:21,473][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:25:21,974][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:25:22,481][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:25:22,986][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:25:23,487][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:25:23,994][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:25:24,495][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:25:24,997][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:25:25,497][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:25:25,999][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:25:26,502][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:25:27,009][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:25:27,509][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:25:28,023][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:25:28,528][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:25:29,048][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:25:29,544][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:25:30,050][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:25:30,552][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:25:31,057][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:25:31,559][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:25:32,054][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:25:32,558][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:25:33,060][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:25:33,566][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:25:34,075][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:25:34,573][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9881 tokens. [2025-11-13 08:25:35,374][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.93%, Current % of VRAM taken: 58.17%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:33 [2025-11-13 08:25:36,130][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:25:36,132][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:25:36,133][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:25:37,142][__main__][INFO] - Iteration 639 took 1m 7s (45.46% Gen, 53.05% Train). Generation: 30s, Training: 35s. Estimated remaining time: 45h 57m 24s. Estimated total time: 56h 13m 20s. Time estimates for 10 more iterations: 11m 14s, 100 more iterations: 1h 52m 26s, 500 more iterations: 9h 22m 13s. [2025-11-13 08:25:37,144][__main__][INFO] - Starting iteration 639. [2025-11-13 08:25:37,615][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 63 and human policies 1. [2025-11-13 08:25:37,617][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:26:04,657][__main__][INFO] - Number of regex retries in iteration 639: 0 [2025-11-13 08:26:04,658][__main__][INFO] - agents played in iteration 639 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:26:05,513][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:05,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:05,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:05,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:26:05,583][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:26:05,584][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:26:06,384][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:26:06,844][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:26:07,362][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:26:07,873][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:26:08,379][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:26:08,884][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:26:09,394][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:26:09,897][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:26:10,404][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:26:10,919][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:26:11,428][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:26:11,934][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:26:12,443][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:26:12,952][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:26:13,466][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:26:13,972][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:26:14,478][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:26:14,987][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:26:15,490][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:26:15,997][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:26:16,500][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:26:17,003][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:26:17,513][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:26:18,020][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:26:18,527][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:26:19,040][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:26:19,546][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:26:20,070][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:26:20,574][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:26:21,077][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:26:21,585][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:26:22,089][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:26:22,594][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:26:23,100][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:26:23,605][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:26:24,108][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:26:24,615][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:26:25,120][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:26:25,623][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:26:26,125][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:26:26,623][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:26:27,123][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:26:27,618][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:26:28,115][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:26:28,620][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:26:29,134][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:26:29,634][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:26:30,134][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:26:30,653][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:26:31,154][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:26:31,655][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:26:32,157][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:26:32,658][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:26:33,170][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:26:33,670][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:26:34,173][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:26:34,674][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:26:35,170][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:26:35,676][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:26:36,177][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:26:36,677][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:26:37,178][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:26:37,683][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:26:38,186][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:26:38,680][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9998 tokens. [2025-11-13 08:26:39,490][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 12.89%, Current % of VRAM taken: 57.14%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:33 [2025-11-13 08:26:40,247][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:26:40,249][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:26:40,250][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:26:41,241][__main__][INFO] - Iteration 640 took 1m 3s (42.50% Gen, 55.94% Train). Generation: 27s, Training: 35s. Estimated remaining time: 42h 44m 19s. Estimated total time: 53h 1m 19s. Time estimates for 10 more iterations: 10m 36s, 100 more iterations: 1h 46m 2s, 500 more iterations: 8h 50m 13s. [2025-11-13 08:26:41,243][__main__][INFO] - Starting iteration 640. [2025-11-13 08:26:41,745][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 63 and human policies 1. [2025-11-13 08:26:41,746][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:27:05,373][__main__][INFO] - Number of regex retries in iteration 640: 0 [2025-11-13 08:27:05,374][__main__][INFO] - agents played in iteration 640 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:27:06,251][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 50.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:06,292][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 50.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:06,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 50.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:06,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 50.29%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:27:06,349][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:27:06,349][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:27:07,166][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:27:07,629][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:27:08,142][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:27:08,644][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:27:09,145][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:27:09,651][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:27:10,168][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:27:10,673][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:27:11,175][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:27:11,678][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:27:12,186][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:27:12,696][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:27:13,202][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:27:13,713][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:27:14,220][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:27:14,728][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:27:15,231][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:27:15,732][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:27:16,236][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:27:16,738][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:27:17,242][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:27:17,754][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:27:18,262][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:27:18,767][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:27:19,279][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:27:19,784][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:27:20,288][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:27:20,796][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:27:21,300][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:27:21,806][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:27:22,312][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:27:22,815][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:27:23,321][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:27:23,825][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:27:24,328][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:27:24,839][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:27:25,346][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:27:25,873][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:27:26,377][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:27:26,891][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:27:27,400][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:27:27,906][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:27:28,413][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:27:28,918][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:27:29,422][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:27:29,932][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:27:30,440][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:27:30,951][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:27:31,456][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:27:31,961][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:27:32,466][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:27:32,973][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:27:33,480][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:27:33,986][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:27:34,491][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:27:35,002][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:27:35,510][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:27:36,015][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:27:36,518][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:27:37,021][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:27:37,525][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:27:38,032][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:27:38,537][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:27:39,053][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:27:39,555][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10041 tokens. [2025-11-13 08:27:40,400][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.06%, Current % of VRAM taken: 58.30%, Block Peak % of device VRAM: 62.11%, ΔTime: 00:00:33 [2025-11-13 08:27:41,030][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:27:41,032][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:27:41,033][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:27:42,717][__main__][INFO] - Iteration 641 took 1m 0s (38.75% Gen, 58.48% Train). Generation: 23s, Training: 35s. Estimated remaining time: 40h 30m 36s. Estimated total time: 50h 48m 38s. Time estimates for 10 more iterations: 10m 9s, 100 more iterations: 1h 41m 37s, 500 more iterations: 8h 28m 6s. [2025-11-13 08:27:42,719][__main__][INFO] - Starting iteration 641. [2025-11-13 08:27:43,200][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 64 and human policies 1. [2025-11-13 08:27:43,201][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:28:13,482][__main__][INFO] - Number of regex retries in iteration 641: 0 [2025-11-13 08:28:13,482][__main__][INFO] - agents played in iteration 641 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:28:14,262][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:14,285][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:14,310][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:14,332][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:28:14,333][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:28:14,334][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:28:15,129][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:28:15,601][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:28:16,106][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:28:16,606][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:28:17,112][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:28:17,619][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:28:18,125][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:28:18,621][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:28:19,124][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:28:19,626][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:28:20,125][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:28:20,630][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:28:21,134][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:28:21,636][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:28:22,140][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:28:22,633][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:28:23,132][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:28:23,634][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:28:24,134][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:28:24,634][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:28:25,136][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:28:25,642][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:28:26,149][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:28:26,653][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:28:27,156][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:28:27,673][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:28:28,173][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:28:28,693][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:28:29,198][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:28:29,703][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:28:30,216][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:28:30,720][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:28:31,228][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:28:31,737][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:28:32,238][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:28:32,744][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:28:33,249][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:28:33,754][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:28:34,258][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:28:34,762][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:28:35,279][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:28:35,781][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:28:36,285][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:28:36,793][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:28:37,301][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:28:37,814][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:28:38,320][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:28:38,825][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:28:39,334][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:28:39,833][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:28:40,337][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:28:40,839][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:28:41,343][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:28:41,844][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:28:42,345][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:28:42,846][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:28:43,352][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:28:43,853][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:28:44,361][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:28:44,868][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:28:45,374][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:28:45,874][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:28:46,381][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:28:46,897][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:28:47,399][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9996 tokens. [2025-11-13 08:28:48,240][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.13%, ΔTime: 00:00:33 [2025-11-13 08:28:48,964][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:28:48,966][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:28:48,968][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:28:49,992][__main__][INFO] - Iteration 642 took 1m 6s (45.34% Gen, 53.13% Train). Generation: 30s, Training: 35s. Estimated remaining time: 45h 20m 29s. Estimated total time: 55h 39m 38s. Time estimates for 10 more iterations: 11m 7s, 100 more iterations: 1h 51m 19s, 500 more iterations: 9h 16m 36s. [2025-11-13 08:28:49,994][__main__][INFO] - Starting iteration 642. [2025-11-13 08:28:50,480][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 64 and human policies 1. [2025-11-13 08:28:50,480][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:29:03,986][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:29:14,921][__main__][INFO] - Number of regex retries in iteration 642: 1 [2025-11-13 08:29:14,921][__main__][INFO] - agents played in iteration 642 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:29:15,751][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:15,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:15,805][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:15,828][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:29:15,828][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:29:15,829][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:29:16,619][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:29:17,094][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:29:17,600][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:29:18,102][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:29:18,605][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:29:19,113][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:29:19,616][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:29:20,116][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:29:20,620][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:29:21,128][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:29:21,629][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:29:22,132][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:29:22,633][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:29:23,137][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:29:23,639][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:29:24,142][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:29:24,645][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:29:25,148][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:29:25,653][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:29:26,157][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:29:26,659][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:29:27,162][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:29:27,667][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:29:28,169][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:29:28,674][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:29:29,177][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:29:29,682][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:29:30,215][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:29:30,721][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:29:31,226][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:29:31,730][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:29:32,236][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:29:32,746][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:29:33,249][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:29:33,756][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:29:34,260][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:29:34,757][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:29:36,169][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:29:36,713][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:29:37,201][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:29:37,706][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:29:38,214][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:29:38,718][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:29:39,229][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:29:39,734][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:29:40,242][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:29:40,744][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:29:41,247][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:29:41,753][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:29:42,261][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:29:42,763][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:29:43,268][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:29:43,774][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:29:44,283][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:29:44,797][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:29:45,307][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:29:45,823][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:29:46,329][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:29:46,831][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:29:47,332][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:29:47,834][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:29:48,339][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:29:48,840][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:29:49,343][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:29:49,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10011 tokens. [2025-11-13 08:29:50,687][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.03%, Current % of VRAM taken: 58.28%, Block Peak % of device VRAM: 62.15%, ΔTime: 00:00:34 [2025-11-13 08:29:51,418][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:29:51,420][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:29:51,421][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:29:52,267][__main__][INFO] - Iteration 643 took 1m 1s (39.56% Gen, 59.07% Train). Generation: 24s, Training: 36s. Estimated remaining time: 41h 9m 13s. Estimated total time: 51h 29m 24s. Time estimates for 10 more iterations: 10m 17s, 100 more iterations: 1h 42m 58s, 500 more iterations: 8h 34m 54s. [2025-11-13 08:29:52,271][__main__][INFO] - Starting iteration 643. [2025-11-13 08:29:52,757][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 64 and human policies 1. [2025-11-13 08:29:52,757][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:30:26,276][__main__][INFO] - Number of regex retries in iteration 643: 0 [2025-11-13 08:30:26,277][__main__][INFO] - agents played in iteration 643 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:30:27,103][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:27,126][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:27,151][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:27,174][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:30:27,174][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:30:27,175][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:30:27,974][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:30:28,430][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:30:28,936][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:30:29,444][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:30:29,950][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:30:30,451][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:30:30,951][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:30:31,454][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:30:31,960][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:30:32,460][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:30:32,962][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:30:33,464][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:30:33,969][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:30:34,473][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:30:34,979][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:30:35,482][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:30:35,990][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:30:36,494][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:30:37,004][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:30:37,521][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:30:38,027][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:30:38,539][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:30:39,042][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:30:39,545][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:30:40,052][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:30:40,560][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:30:41,066][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:30:41,571][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:30:42,079][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:30:42,587][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:30:43,089][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:30:43,596][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:30:44,100][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:30:44,603][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:30:45,118][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:30:45,623][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:30:46,126][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:30:46,628][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:30:47,134][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:30:47,645][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:30:48,149][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:30:48,652][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:30:49,164][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:30:49,672][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:30:50,176][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:30:50,681][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:30:51,190][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:30:51,696][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:30:52,200][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:30:52,703][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:30:53,219][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:30:53,721][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:30:54,239][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:30:54,739][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:30:55,243][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:30:55,747][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:30:56,249][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:30:56,751][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:30:57,250][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:30:57,750][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:30:58,256][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:30:58,753][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:30:59,253][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:30:59,754][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:31:00,253][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10048 tokens. [2025-11-13 08:31:01,074][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.98%, Current % of VRAM taken: 58.22%, Block Peak % of device VRAM: 62.10%, ΔTime: 00:00:33 [2025-11-13 08:31:01,829][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:31:01,831][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:31:01,832][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:31:02,764][__main__][INFO] - Iteration 644 took 1m 10s (47.88% Gen, 50.79% Train). Generation: 33s, Training: 35s. Estimated remaining time: 47h 59m 3s. Estimated total time: 58h 20m 24s. Time estimates for 10 more iterations: 11m 40s, 100 more iterations: 1h 56m 40s, 500 more iterations: 9h 43m 24s. [2025-11-13 08:31:02,766][__main__][INFO] - Starting iteration 644. [2025-11-13 08:31:03,255][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 64 and human policies 1. [2025-11-13 08:31:03,256][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:31:28,723][__main__][INFO] - Number of regex retries in iteration 644: 0 [2025-11-13 08:31:28,724][__main__][INFO] - agents played in iteration 644 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:31:29,539][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:29,579][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:29,611][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:29,638][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:31:29,639][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:31:29,640][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:31:30,432][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:31:30,894][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:31:31,401][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:31:31,902][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:31:32,407][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:31:32,915][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:31:33,419][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:31:33,920][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:31:34,421][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:31:34,923][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:31:35,423][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:31:35,922][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:31:36,417][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:31:36,918][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:31:37,419][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:31:37,920][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:31:38,420][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:31:38,932][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:31:39,434][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:31:39,953][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:31:40,460][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:31:40,961][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:31:41,473][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:31:41,979][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:31:42,485][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:31:42,990][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:31:43,493][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:31:43,995][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:31:44,503][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:31:45,013][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:31:45,525][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:31:46,029][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:31:46,546][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:31:47,049][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:31:47,547][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:31:48,064][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:31:48,569][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:31:49,073][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:31:49,578][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:31:50,082][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:31:50,584][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:31:51,084][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:31:51,582][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:31:52,081][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:31:52,583][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:31:53,091][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:31:53,595][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:31:54,098][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:31:54,610][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:31:55,111][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:31:55,610][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:31:56,107][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:31:56,603][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:31:57,102][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:31:57,601][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:31:58,110][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:31:58,625][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:31:59,126][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:31:59,632][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:32:00,133][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:32:00,635][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:32:01,151][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:32:01,652][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:32:02,155][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:32:02,659][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9914 tokens. [2025-11-13 08:32:03,479][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.32%, Block Peak % of device VRAM: 62.12%, ΔTime: 00:00:33 [2025-11-13 08:32:04,239][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:32:04,241][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:32:04,242][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:32:05,158][__main__][INFO] - Iteration 645 took 1m 1s (41.14% Gen, 57.38% Train). Generation: 25s, Training: 35s. Estimated remaining time: 41h 12m 46s. Estimated total time: 51h 35m 9s. Time estimates for 10 more iterations: 10m 19s, 100 more iterations: 1h 43m 10s, 500 more iterations: 8h 35m 51s. [2025-11-13 08:32:05,160][__main__][INFO] - Starting iteration 645. [2025-11-13 08:32:05,661][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 64 and human policies 1. [2025-11-13 08:32:05,662][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:32:36,537][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:32:37,377][__main__][INFO] - Number of regex retries in iteration 645: 1 [2025-11-13 08:32:37,378][__main__][INFO] - agents played in iteration 645 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:32:38,279][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:38,321][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:38,352][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:38,380][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:32:38,380][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:32:38,381][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:32:39,243][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:32:39,709][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:32:40,219][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:32:40,741][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:32:41,248][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:32:41,757][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:32:42,262][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:32:42,766][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:32:43,278][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:32:43,784][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:32:44,289][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:32:44,797][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:32:45,303][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:32:45,808][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:32:46,319][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:32:46,830][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:32:47,360][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:32:47,866][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:32:48,376][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:32:48,885][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:32:49,395][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:32:49,903][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:32:50,411][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:32:50,917][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:32:51,438][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:32:51,942][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:32:52,457][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:32:52,961][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:32:53,470][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:32:53,981][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:32:54,486][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:32:54,991][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:32:55,497][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:32:56,006][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:32:56,511][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:32:57,009][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:32:57,512][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:32:58,028][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:32:58,527][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:32:59,036][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:32:59,540][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:33:00,044][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:33:00,568][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:33:01,076][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:33:01,580][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:33:02,087][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:33:02,595][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:33:03,101][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:33:03,605][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:33:04,111][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:33:04,617][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:33:05,121][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:33:05,620][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:33:06,122][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:33:06,624][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:33:07,145][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:33:07,649][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:33:08,152][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:33:08,654][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:33:09,156][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:33:09,672][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:33:10,178][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:33:10,679][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:33:11,177][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:33:11,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10062 tokens. [2025-11-13 08:33:12,518][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.37%, Block Peak % of device VRAM: 62.33%, ΔTime: 00:00:33 [2025-11-13 08:33:13,196][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:33:13,198][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:33:13,199][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:33:14,124][__main__][INFO] - Iteration 646 took 1m 8s (46.32% Gen, 52.32% Train). Generation: 31s, Training: 35s. Estimated remaining time: 46h 39m 38s. Estimated total time: 57h 3m 11s. Time estimates for 10 more iterations: 11m 24s, 100 more iterations: 1h 54m 6s, 500 more iterations: 9h 30m 31s. [2025-11-13 08:33:14,126][__main__][INFO] - Starting iteration 646. [2025-11-13 08:33:14,613][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 64 and human policies 1. [2025-11-13 08:33:14,614][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:33:39,061][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:33:43,773][__main__][INFO] - Number of regex retries in iteration 646: 1 [2025-11-13 08:33:43,774][__main__][INFO] - agents played in iteration 646 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:33:44,617][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:44,641][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:44,665][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:44,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:33:44,688][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:33:44,689][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:33:45,554][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:33:46,020][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:33:46,529][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:33:47,035][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:33:47,545][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:33:48,049][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:33:48,553][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:33:49,072][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:33:49,576][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:33:50,083][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:33:50,590][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:33:51,101][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:33:51,607][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:33:52,113][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:33:52,618][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:33:53,125][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:33:53,631][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:33:54,136][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:33:54,646][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:33:55,153][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:33:55,670][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:33:56,173][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:33:56,679][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:33:57,181][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:33:57,685][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:33:58,187][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:33:58,695][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:33:59,200][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:33:59,706][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:34:00,213][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:34:00,712][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:34:01,209][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:34:01,717][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:34:02,222][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:34:02,730][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:34:03,232][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:34:03,743][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:34:04,244][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:34:04,755][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:34:05,256][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:34:05,765][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:34:06,277][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:34:06,784][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:34:07,293][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:34:07,799][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:34:08,305][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:34:08,811][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:34:09,314][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:34:09,821][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:34:10,322][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:34:10,823][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:34:11,325][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:34:11,828][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:34:12,328][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:34:12,834][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:34:13,341][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:34:13,844][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:34:14,344][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:34:14,850][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:34:15,353][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:34:15,857][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:34:16,357][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:34:16,857][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:34:17,359][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:34:17,884][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9979 tokens. [2025-11-13 08:34:18,718][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.05%, Current % of VRAM taken: 58.30%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 08:34:19,482][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:34:19,484][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:34:19,486][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:34:20,463][__main__][INFO] - Iteration 647 took 1m 5s (44.28% Gen, 54.23% Train). Generation: 29s, Training: 35s. Estimated remaining time: 44h 27m 52s. Estimated total time: 54h 52m 31s. Time estimates for 10 more iterations: 10m 58s, 100 more iterations: 1h 49m 45s, 500 more iterations: 9h 8m 45s. [2025-11-13 08:34:20,465][__main__][INFO] - Starting iteration 647. [2025-11-13 08:34:20,971][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 64 and human policies 1. [2025-11-13 08:34:20,971][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:34:47,388][__main__][INFO] - Number of regex retries in iteration 647: 0 [2025-11-13 08:34:47,388][__main__][INFO] - agents played in iteration 647 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:34:48,173][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:48,198][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:48,223][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:48,245][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:34:48,245][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:34:48,246][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:34:49,056][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:34:49,518][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:34:50,029][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:34:50,534][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:34:51,038][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:34:51,535][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:34:52,036][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:34:52,543][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:34:53,045][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:34:53,551][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:34:54,060][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:34:54,564][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:34:55,069][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:34:55,577][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:34:56,101][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:34:56,610][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:34:58,427][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:34:59,019][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:34:59,530][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:35:00,032][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:35:00,538][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:35:01,045][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:35:01,558][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:35:02,062][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:35:02,570][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:35:03,077][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:35:03,573][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:35:04,079][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:35:04,581][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:35:05,082][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:35:05,584][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:35:06,083][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:35:06,598][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:35:07,104][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:35:07,615][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:35:08,122][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:35:08,628][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:35:09,135][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:35:09,645][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:35:10,153][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:35:10,671][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:35:11,173][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:35:11,689][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:35:12,195][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:35:12,698][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:35:13,204][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:35:13,712][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:35:14,215][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:35:14,720][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:35:15,224][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:35:15,730][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:35:16,235][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:35:16,743][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:35:17,256][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:35:17,760][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:35:18,270][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:35:18,773][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:35:19,276][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:35:19,807][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:35:20,309][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:35:20,830][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:35:21,338][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:35:21,840][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:35:22,346][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:35:22,849][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10032 tokens. [2025-11-13 08:35:23,691][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.33%, ΔTime: 00:00:34 [2025-11-13 08:35:24,406][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:35:24,407][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:35:24,409][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:35:25,251][__main__][INFO] - Iteration 648 took 1m 4s (41.10% Gen, 57.59% Train). Generation: 26s, Training: 37s. Estimated remaining time: 43h 8m 20s. Estimated total time: 53h 34m 3s. Time estimates for 10 more iterations: 10m 42s, 100 more iterations: 1h 47m 8s, 500 more iterations: 8h 55m 40s. [2025-11-13 08:35:25,254][__main__][INFO] - Starting iteration 648. [2025-11-13 08:35:25,732][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 64 and human policies 1. [2025-11-13 08:35:25,733][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:35:47,350][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:35:50,767][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:35:58,567][__main__][INFO] - Number of regex retries in iteration 648: 2 [2025-11-13 08:35:58,567][__main__][INFO] - agents played in iteration 648 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:35:59,346][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:59,373][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:59,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:59,422][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:35:59,423][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:35:59,424][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:36:00,232][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:36:00,689][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:36:01,197][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:36:01,699][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:36:02,209][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:36:02,716][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:36:03,221][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:36:03,726][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:36:04,237][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:36:04,743][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:36:05,255][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:36:05,768][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:36:06,276][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:36:06,796][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:36:07,303][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:36:07,813][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:36:08,322][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:36:08,826][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:36:09,329][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:36:09,835][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:36:10,338][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:36:10,843][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:36:11,347][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:36:11,853][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:36:12,364][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:36:12,863][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:36:13,368][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:36:13,869][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:36:14,366][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:36:14,881][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:36:15,387][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:36:15,883][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:36:16,391][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:36:16,895][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:36:17,404][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:36:17,908][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:36:18,415][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:36:18,927][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:36:19,432][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:36:19,937][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:36:20,442][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:36:20,946][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:36:21,455][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:36:21,963][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:36:22,469][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:36:22,988][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:36:23,493][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:36:24,005][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:36:24,505][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:36:25,014][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:36:25,516][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:36:26,012][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:36:26,511][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:36:27,015][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:36:27,512][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:36:28,028][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:36:28,527][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:36:29,029][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:36:29,534][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:36:30,035][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:36:30,541][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:36:31,041][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:36:31,550][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:36:32,056][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:36:32,562][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10017 tokens. [2025-11-13 08:36:33,432][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.47%, Block Peak % of device VRAM: 62.19%, ΔTime: 00:00:33 [2025-11-13 08:36:34,190][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:36:34,193][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:36:34,195][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:36:35,208][__main__][INFO] - Iteration 649 took 1m 9s (47.26% Gen, 51.28% Train). Generation: 32s, Training: 35s. Estimated remaining time: 47h 26m 54s. Estimated total time: 57h 53m 48s. Time estimates for 10 more iterations: 11m 34s, 100 more iterations: 1h 55m 47s, 500 more iterations: 9h 38m 58s. [2025-11-13 08:36:35,210][__main__][INFO] - Starting iteration 649. [2025-11-13 08:36:35,701][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 64 and human policies 1. [2025-11-13 08:36:35,702][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:37:06,639][__main__][INFO] - Number of regex retries in iteration 649: 0 [2025-11-13 08:37:06,640][__main__][INFO] - agents played in iteration 649 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:37:07,488][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:07,513][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:07,538][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:07,561][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:37:07,561][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:37:07,562][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:37:08,423][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:37:08,883][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:37:09,400][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:37:09,900][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:37:10,411][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:37:10,917][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:37:11,418][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:37:11,922][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:37:12,421][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:37:12,923][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:37:13,423][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:37:13,933][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:37:14,445][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:37:14,953][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:37:15,458][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:37:15,964][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:37:16,465][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:37:16,976][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:37:17,482][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:37:17,980][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:37:18,476][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:37:18,979][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:37:19,487][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:37:19,981][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:37:20,482][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:37:20,991][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:37:21,497][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:37:22,002][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:37:22,510][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:37:23,010][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:37:23,521][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:37:24,021][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:37:24,528][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:37:25,023][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:37:25,526][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:37:26,037][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:37:26,536][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:37:27,037][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:37:27,547][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:37:28,051][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:37:28,548][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:37:29,048][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:37:29,548][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:37:30,055][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:37:30,556][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:37:31,059][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:37:31,562][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:37:32,063][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:37:32,563][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:37:33,064][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:37:33,565][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:37:34,067][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:37:34,565][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:37:35,062][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:37:35,562][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:37:36,062][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:37:36,565][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:37:37,074][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:37:37,575][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:37:38,076][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:37:38,577][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:37:39,077][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:37:39,579][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:37:40,081][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:37:40,586][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10020 tokens. [2025-11-13 08:37:41,465][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.10%, ΔTime: 00:00:33 [2025-11-13 08:37:42,226][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:37:42,228][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:37:42,230][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:37:43,203][__main__][INFO] - Iteration 650 took 1m 7s (45.83% Gen, 52.72% Train). Generation: 30s, Training: 35s. Estimated remaining time: 45h 47m 5s. Estimated total time: 56h 15m 6s. Time estimates for 10 more iterations: 11m 15s, 100 more iterations: 1h 52m 30s, 500 more iterations: 9h 22m 31s. [2025-11-13 08:37:43,206][__main__][INFO] - Starting iteration 650. [2025-11-13 08:37:43,692][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 64 and human policies 1. [2025-11-13 08:37:43,693][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:38:07,305][__main__][INFO] - Number of regex retries in iteration 650: 0 [2025-11-13 08:38:07,306][__main__][INFO] - agents played in iteration 650 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:38:08,277][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:08,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:08,366][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:08,395][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:38:08,396][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:38:08,397][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:38:09,203][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:38:09,665][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:38:10,177][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:38:10,690][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:38:11,196][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:38:11,705][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:38:12,212][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:38:12,719][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:38:13,224][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:38:13,729][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:38:14,233][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:38:14,752][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:38:15,256][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:38:15,761][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:38:16,264][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:38:16,767][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:38:17,276][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:38:17,782][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:38:18,290][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:38:18,801][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:38:19,312][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:38:19,826][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:38:20,330][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:38:20,840][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:38:21,359][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:38:21,868][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:38:22,379][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:38:22,882][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:38:23,385][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:38:23,893][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:38:24,397][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:38:24,903][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:38:25,420][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:38:25,926][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:38:26,445][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:38:26,948][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:38:27,457][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:38:27,967][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:38:28,470][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:38:28,976][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:38:29,482][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:38:29,993][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:38:30,497][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:38:31,006][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:38:31,512][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:38:32,029][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:38:32,538][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:38:33,050][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:38:33,550][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:38:34,053][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:38:34,562][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:38:35,068][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:38:35,570][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:38:36,073][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:38:36,578][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:38:37,083][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:38:37,590][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:38:38,095][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:38:38,598][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:38:39,100][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:38:39,602][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:38:40,103][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:38:40,606][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:38:41,110][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:38:41,613][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10121 tokens. [2025-11-13 08:38:42,473][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:33 [2025-11-13 08:38:43,203][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:38:43,205][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:38:43,207][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:38:45,328][__main__][INFO] - Iteration 651 took 1m 1s (38.31% Gen, 58.25% Train). Generation: 23s, Training: 35s. Estimated remaining time: 40h 52m 44s. Estimated total time: 51h 21m 48s. Time estimates for 10 more iterations: 10m 16s, 100 more iterations: 1h 42m 43s, 500 more iterations: 8h 33m 38s. [2025-11-13 08:38:45,330][__main__][INFO] - Starting iteration 651. [2025-11-13 08:38:45,800][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 65 and human policies 1. [2025-11-13 08:38:45,800][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:39:13,705][__main__][INFO] - Number of regex retries in iteration 651: 0 [2025-11-13 08:39:13,705][__main__][INFO] - agents played in iteration 651 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:39:14,535][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:14,562][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:14,588][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:14,611][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:39:14,612][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:39:14,613][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:39:15,408][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:39:15,866][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:39:16,374][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:39:16,873][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:39:17,374][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:39:17,877][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:39:18,382][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:39:18,884][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:39:19,386][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:39:19,887][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:39:20,387][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:39:20,891][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:39:21,394][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:39:21,902][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:39:22,411][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:39:22,915][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:39:23,417][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:39:23,929][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:39:24,448][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:39:24,950][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:39:25,454][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:39:25,959][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:39:26,466][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:39:26,972][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:39:27,479][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:39:27,981][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:39:28,489][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:39:28,986][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:39:29,485][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:39:29,993][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:39:30,494][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:39:30,994][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:39:31,499][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:39:32,002][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:39:32,508][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:39:33,016][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:39:33,523][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:39:34,038][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:39:34,543][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:39:35,059][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:39:35,562][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:39:36,071][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:39:36,575][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:39:37,075][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:39:37,577][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:39:38,079][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:39:38,586][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:39:39,092][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:39:39,598][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:39:40,106][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:39:40,619][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:39:41,132][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:39:41,642][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:39:42,152][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:39:42,658][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:39:43,162][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:39:43,663][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:39:44,164][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:39:44,669][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:39:45,170][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:39:45,674][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:39:46,177][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:39:46,676][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:39:47,177][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:39:47,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9984 tokens. [2025-11-13 08:39:48,532][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.99%, Current % of VRAM taken: 58.24%, Block Peak % of device VRAM: 62.11%, ΔTime: 00:00:33 [2025-11-13 08:39:49,288][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:39:49,290][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:39:49,292][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:39:50,193][__main__][INFO] - Iteration 652 took 1m 4s (43.33% Gen, 55.26% Train). Generation: 27s, Training: 35s. Estimated remaining time: 43h 9m 33s. Estimated total time: 53h 39m 42s. Time estimates for 10 more iterations: 10m 43s, 100 more iterations: 1h 47m 19s, 500 more iterations: 8h 56m 37s. [2025-11-13 08:39:50,195][__main__][INFO] - Starting iteration 652. [2025-11-13 08:39:50,685][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 65 and human policies 1. [2025-11-13 08:39:50,685][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:40:12,649][__main__][INFO] - Number of regex retries in iteration 652: 0 [2025-11-13 08:40:12,651][__main__][INFO] - agents played in iteration 652 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:40:13,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:13,556][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:13,585][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:13,612][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:40:13,612][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:40:13,613][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:40:14,514][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:40:14,982][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:40:15,497][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:40:15,998][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:40:16,498][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:40:16,994][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:40:17,495][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:40:17,995][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:40:18,493][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:40:18,996][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:40:19,492][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:40:19,991][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:40:20,495][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:40:20,992][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:40:21,485][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:40:21,984][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:40:22,491][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:40:22,991][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:40:23,494][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:40:24,001][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:40:24,502][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:40:25,005][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:40:25,515][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:40:26,020][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:40:26,534][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:40:27,036][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:40:27,548][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:40:28,048][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:40:28,551][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:40:29,067][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:40:29,570][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:40:30,081][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:40:30,585][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:40:31,093][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:40:31,600][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:40:32,107][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:40:32,613][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:40:33,122][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:40:33,629][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:40:34,147][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:40:34,648][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:40:35,147][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:40:35,664][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:40:36,167][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:40:36,674][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:40:37,177][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:40:37,685][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:40:38,190][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:40:38,697][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:40:39,200][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:40:39,708][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:40:40,210][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:40:40,718][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:40:41,221][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:40:41,725][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:40:42,246][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:40:42,755][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:40:43,267][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:40:43,768][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:40:44,273][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:40:44,785][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:40:45,289][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:40:45,795][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:40:46,301][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:40:46,806][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9997 tokens. [2025-11-13 08:40:47,731][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 08:40:48,383][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:40:48,385][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:40:48,387][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:40:49,184][__main__][INFO] - Iteration 653 took 58s (37.55% Gen, 61.09% Train). Generation: 21s, Training: 35s. Estimated remaining time: 38h 13m 52s. Estimated total time: 48h 45m 0s. Time estimates for 10 more iterations: 9m 45s, 100 more iterations: 1h 37m 30s, 500 more iterations: 8h 7m 30s. [2025-11-13 08:40:49,186][__main__][INFO] - Starting iteration 653. [2025-11-13 08:40:49,687][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 65 and human policies 1. [2025-11-13 08:40:49,688][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:41:20,332][__main__][INFO] - Number of regex retries in iteration 653: 0 [2025-11-13 08:41:20,333][__main__][INFO] - agents played in iteration 653 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:41:21,186][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:21,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:21,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:21,256][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:41:21,256][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:41:21,257][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:41:22,135][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:41:22,602][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:41:23,110][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:41:23,616][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:41:24,121][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:41:24,625][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:41:25,136][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:41:25,636][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:41:26,138][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:41:26,645][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:41:27,153][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:41:27,663][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:41:28,165][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:41:28,670][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:41:29,173][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:41:29,672][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:41:30,167][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:41:30,670][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:41:31,169][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:41:31,678][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:41:32,177][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:41:32,678][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:41:33,176][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:41:33,675][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:41:34,177][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:41:34,676][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:41:35,177][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:41:35,682][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:41:36,184][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:41:36,688][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:41:37,192][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:41:37,696][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:41:38,204][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:41:38,712][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:41:39,218][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:41:39,729][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:41:40,236][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:41:40,749][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:41:41,255][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:41:41,759][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:41:42,262][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:41:42,773][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:41:43,278][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:41:43,782][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:41:44,288][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:41:44,788][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:41:45,294][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:41:45,799][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:41:46,304][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:41:46,812][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:41:47,318][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:41:47,822][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:41:48,336][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:41:48,850][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:41:49,357][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:41:49,866][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:41:50,369][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:41:50,873][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:41:51,378][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:41:51,880][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:41:52,383][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:41:52,880][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:41:53,385][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:41:53,889][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:41:54,390][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9886 tokens. [2025-11-13 08:41:55,216][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.01%, Current % of VRAM taken: 58.26%, Block Peak % of device VRAM: 62.09%, ΔTime: 00:00:33 [2025-11-13 08:41:55,949][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:41:55,951][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:41:55,953][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:41:56,855][__main__][INFO] - Iteration 654 took 1m 7s (45.62% Gen, 53.03% Train). Generation: 30s, Training: 35s. Estimated remaining time: 45h 26m 12s. Estimated total time: 55h 58m 27s. Time estimates for 10 more iterations: 11m 11s, 100 more iterations: 1h 51m 56s, 500 more iterations: 9h 19m 44s. [2025-11-13 08:41:56,857][__main__][INFO] - Starting iteration 654. [2025-11-13 08:41:57,346][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 65 and human policies 1. [2025-11-13 08:41:57,346][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:42:08,975][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:42:09,627][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:42:19,926][__main__][INFO] - Number of regex retries in iteration 654: 2 [2025-11-13 08:42:19,927][__main__][INFO] - agents played in iteration 654 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:42:20,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:20,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:20,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:20,818][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:42:20,819][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:42:20,819][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:42:21,700][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:42:22,157][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:42:22,673][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:42:23,170][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:42:23,666][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:42:24,182][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:42:24,690][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:42:25,195][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:42:25,704][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:42:26,210][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:42:26,715][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:42:27,222][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:42:27,724][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:42:28,228][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:42:28,728][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:42:29,238][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:42:29,738][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:42:30,237][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:42:30,740][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:42:31,241][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:42:31,744][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:42:32,258][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:42:32,756][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:42:33,278][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:42:33,776][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:42:34,303][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:42:34,805][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:42:35,314][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:42:35,816][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:42:36,322][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:42:36,828][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:42:37,331][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:42:37,833][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:42:38,337][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:42:38,846][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:42:39,351][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:42:39,858][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:42:40,363][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:42:40,866][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:42:41,374][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:42:41,877][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:42:42,395][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:42:42,898][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:42:43,452][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:42:44,979][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:42:45,477][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:42:45,976][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:42:46,487][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:42:47,001][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:42:47,504][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:42:48,011][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:42:48,521][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:42:49,026][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:42:49,529][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:42:50,036][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:42:50,540][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:42:51,044][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:42:51,541][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:42:52,047][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:42:52,549][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:42:53,061][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:42:53,563][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:42:54,066][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:42:54,576][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:42:55,080][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9928 tokens. [2025-11-13 08:42:56,097][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.11%, ΔTime: 00:00:34 [2025-11-13 08:42:56,814][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:42:56,817][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:42:56,819][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:42:57,657][__main__][INFO] - Iteration 655 took 1m 0s (37.44% Gen, 61.17% Train). Generation: 22s, Training: 36s. Estimated remaining time: 39h 42m 17s. Estimated total time: 50h 15m 33s. Time estimates for 10 more iterations: 10m 3s, 100 more iterations: 1h 40m 31s, 500 more iterations: 8h 22m 35s. [2025-11-13 08:42:57,660][__main__][INFO] - Starting iteration 655. [2025-11-13 08:42:58,164][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 65 and human policies 1. [2025-11-13 08:42:58,165][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:43:35,349][__main__][INFO] - Number of regex retries in iteration 655: 0 [2025-11-13 08:43:35,350][__main__][INFO] - agents played in iteration 655 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:43:36,159][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:36,187][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:36,214][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:36,238][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:43:36,238][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:43:36,239][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:43:37,028][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:43:37,486][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:43:37,991][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:43:38,493][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:43:38,995][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:43:39,500][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:43:40,011][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:43:40,516][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:43:41,016][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:43:41,519][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:43:42,020][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:43:42,521][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:43:43,021][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:43:43,523][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:43:44,024][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:43:44,527][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:43:45,028][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:43:45,532][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:43:46,038][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:43:46,540][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:43:47,043][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:43:47,550][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:43:48,055][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:43:48,559][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:43:49,065][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:43:49,568][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:43:50,075][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:43:50,595][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:43:51,100][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:43:51,610][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:43:52,118][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:43:52,624][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:43:53,140][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:43:53,644][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:43:54,150][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:43:54,655][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:43:55,160][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:43:55,665][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:43:56,174][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:43:56,684][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:43:57,201][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:43:57,701][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:43:58,203][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:43:58,701][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:43:59,207][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:43:59,718][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:44:00,222][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:44:00,725][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:44:01,228][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:44:01,731][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:44:02,234][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:44:02,741][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:44:03,245][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:44:03,743][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:44:04,248][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:44:04,750][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:44:05,264][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:44:05,772][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:44:06,290][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:44:06,792][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:44:07,297][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:44:07,803][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:44:08,303][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:44:08,811][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:44:09,312][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10011 tokens. [2025-11-13 08:44:10,141][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.01%, Current % of VRAM taken: 58.25%, Block Peak % of device VRAM: 62.04%, ΔTime: 00:00:33 [2025-11-13 08:44:10,887][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:44:10,889][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:44:10,890][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:44:11,811][__main__][INFO] - Iteration 656 took 1m 13s (50.49% Gen, 48.26% Train). Generation: 37s, Training: 35s. Estimated remaining time: 50h 47m 52s. Estimated total time: 61h 22m 22s. Time estimates for 10 more iterations: 12m 16s, 100 more iterations: 2h 2m 44s, 500 more iterations: 10h 13m 43s. [2025-11-13 08:44:11,813][__main__][INFO] - Starting iteration 656. [2025-11-13 08:44:12,302][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 65 and human policies 1. [2025-11-13 08:44:12,303][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:44:43,593][__main__][INFO] - Number of regex retries in iteration 656: 0 [2025-11-13 08:44:43,594][__main__][INFO] - agents played in iteration 656 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:44:44,381][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:44,406][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:44,431][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:44,454][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:44:44,454][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:44:44,455][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:44:45,261][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:44:45,725][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:44:46,236][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:44:46,736][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:44:47,236][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:44:47,739][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:44:48,242][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:44:48,748][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:44:49,251][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:44:49,749][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:44:50,247][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:44:50,752][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:44:51,253][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:44:51,748][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:44:52,247][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:44:52,752][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:44:53,253][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:44:53,757][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:44:54,264][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:44:54,762][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:44:55,258][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:44:55,761][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:44:56,261][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:44:56,760][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:44:57,266][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:44:57,772][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:44:58,277][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:44:58,789][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:44:59,300][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:44:59,804][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:45:00,311][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:45:00,817][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:45:01,329][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:45:01,835][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:45:02,339][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:45:02,841][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:45:03,338][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:45:03,844][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:45:04,344][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:45:04,850][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:45:05,355][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:45:05,862][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:45:06,367][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:45:06,873][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:45:07,395][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:45:07,918][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:45:08,422][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:45:08,925][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:45:09,430][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:45:09,934][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:45:10,440][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:45:10,942][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:45:11,447][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:45:11,955][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:45:12,458][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:45:12,964][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:45:13,471][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:45:13,974][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:45:14,477][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:45:14,973][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:45:15,470][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:45:15,983][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:45:16,484][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:45:16,986][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:45:17,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9937 tokens. [2025-11-13 08:45:18,300][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.98%, Current % of VRAM taken: 58.22%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:33 [2025-11-13 08:45:19,063][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:45:19,067][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:45:19,068][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:45:19,996][__main__][INFO] - Iteration 657 took 1m 7s (46.22% Gen, 52.40% Train). Generation: 31s, Training: 35s. Estimated remaining time: 45h 49m 5s. Estimated total time: 56h 24m 43s. Time estimates for 10 more iterations: 11m 16s, 100 more iterations: 1h 52m 49s, 500 more iterations: 9h 24m 7s. [2025-11-13 08:45:19,998][__main__][INFO] - Starting iteration 657. [2025-11-13 08:45:20,486][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 65 and human policies 1. [2025-11-13 08:45:20,487][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:45:30,270][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:45:41,091][__main__][INFO] - Number of regex retries in iteration 657: 1 [2025-11-13 08:45:41,091][__main__][INFO] - agents played in iteration 657 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:45:42,076][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:42,110][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:42,136][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:42,160][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:45:42,161][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:45:42,161][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:45:43,039][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:45:43,511][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:45:44,022][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:45:44,535][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:45:45,035][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:45:45,532][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:45:46,031][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:45:46,534][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:45:47,032][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:45:47,540][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:45:48,040][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:45:48,542][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:45:49,038][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:45:49,539][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:45:50,044][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:45:50,541][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:45:51,037][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:45:51,540][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:45:52,043][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:45:52,545][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:45:53,046][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:45:53,548][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:45:54,048][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:45:54,553][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:45:55,057][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:45:55,562][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:45:56,067][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:45:56,571][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:45:57,072][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:45:57,578][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:45:58,097][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:45:58,607][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:45:59,114][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:45:59,619][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:46:00,122][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:46:00,631][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:46:01,140][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:46:01,643][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:46:02,146][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:46:02,655][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:46:03,161][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:46:03,667][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:46:04,173][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:46:04,678][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:46:05,181][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:46:05,684][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:46:06,203][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:46:06,705][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:46:07,228][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:46:07,732][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:46:08,236][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:46:08,738][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:46:09,237][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:46:09,738][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:46:10,241][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:46:10,747][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:46:11,259][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:46:11,760][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:46:12,262][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:46:12,765][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:46:13,273][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:46:13,781][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:46:14,287][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:46:14,802][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:46:15,313][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9919 tokens. [2025-11-13 08:46:16,227][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.30%, Current % of VRAM taken: 58.54%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:33 [2025-11-13 08:46:16,916][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:46:16,917][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:46:16,919][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:46:17,908][__main__][INFO] - Iteration 658 took 57s (35.88% Gen, 62.39% Train). Generation: 20s, Training: 35s. Estimated remaining time: 37h 14m 30s. Estimated total time: 47h 51m 6s. Time estimates for 10 more iterations: 9m 34s, 100 more iterations: 1h 35m 42s, 500 more iterations: 7h 58m 31s. [2025-11-13 08:46:17,910][__main__][INFO] - Starting iteration 658. [2025-11-13 08:46:18,397][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 65 and human policies 1. [2025-11-13 08:46:18,398][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:46:50,706][__main__][INFO] - Number of regex retries in iteration 658: 0 [2025-11-13 08:46:50,707][__main__][INFO] - agents played in iteration 658 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:46:51,549][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:51,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:51,602][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:51,625][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:46:51,626][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:46:51,626][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:46:52,427][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:46:52,889][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:46:53,395][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:46:53,903][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:46:54,410][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:46:54,920][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:46:55,428][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:46:55,937][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:46:56,445][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:46:56,945][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:46:57,451][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:46:57,952][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:46:58,449][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:46:58,950][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:46:59,455][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:46:59,952][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:47:00,456][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:47:00,958][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:47:01,463][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:47:01,967][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:47:02,469][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:47:02,972][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:47:03,479][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:47:03,987][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:47:04,495][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:47:05,001][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:47:05,507][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:47:06,015][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:47:06,537][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:47:07,043][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:47:07,549][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:47:08,056][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:47:08,561][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:47:09,067][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:47:09,576][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:47:10,080][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:47:10,588][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:47:11,096][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:47:11,603][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:47:12,117][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:47:12,621][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:47:13,133][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:47:13,636][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:47:14,139][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:47:14,648][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:47:15,152][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:47:15,657][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:47:16,162][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:47:16,665][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:47:17,173][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:47:17,679][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:47:18,182][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:47:18,687][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:47:19,188][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:47:19,687][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:47:20,187][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:47:20,686][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:47:21,186][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:47:21,687][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:47:22,188][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:47:22,688][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:47:23,187][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:47:23,691][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:47:24,189][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:47:24,684][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9956 tokens. [2025-11-13 08:47:25,513][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.97%, Current % of VRAM taken: 58.21%, Block Peak % of device VRAM: 62.49%, ΔTime: 00:00:33 [2025-11-13 08:47:26,243][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:47:26,245][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:47:26,246][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:47:27,128][__main__][INFO] - Iteration 659 took 1m 8s (47.01% Gen, 51.71% Train). Generation: 32s, Training: 35s. Estimated remaining time: 46h 38m 47s. Estimated total time: 57h 16m 32s. Time estimates for 10 more iterations: 11m 27s, 100 more iterations: 1h 54m 33s, 500 more iterations: 9h 32m 45s. [2025-11-13 08:47:27,130][__main__][INFO] - Starting iteration 659. [2025-11-13 08:47:27,599][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 65 and human policies 1. [2025-11-13 08:47:27,599][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:47:52,377][__main__][INFO] - Number of regex retries in iteration 659: 0 [2025-11-13 08:47:52,378][__main__][INFO] - agents played in iteration 659 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:47:53,213][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:53,236][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:53,258][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:53,281][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:47:53,282][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:47:53,282][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:47:54,096][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:47:54,548][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:47:55,045][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:47:55,558][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:47:56,060][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:47:56,556][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:47:57,050][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:47:57,557][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:47:58,058][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:47:58,562][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:47:59,087][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:47:59,591][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:48:00,098][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:48:00,607][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:48:01,112][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:48:01,615][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:48:02,119][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:48:02,621][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:48:03,130][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:48:03,635][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:48:04,143][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:48:04,659][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:48:05,160][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:48:05,667][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:48:06,222][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:48:07,094][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:48:07,811][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:48:08,318][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:48:08,832][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:48:09,343][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:48:09,851][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:48:10,358][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:48:10,865][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:48:11,376][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:48:11,875][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:48:12,382][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:48:12,890][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:48:13,401][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:48:13,919][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:48:14,424][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:48:14,931][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:48:15,434][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:48:15,940][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:48:16,454][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:48:16,960][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:48:17,460][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:48:17,968][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:48:18,479][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:48:18,988][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:48:19,493][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:48:19,995][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:48:20,503][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:48:21,003][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:48:21,509][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:48:22,018][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:48:22,522][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:48:23,022][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:48:23,532][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:48:24,038][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:48:24,554][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:48:25,061][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:48:25,559][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:48:26,062][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:48:26,557][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:48:27,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10012 tokens. [2025-11-13 08:48:27,907][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 08:48:28,641][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:48:28,642][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:48:28,644][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:48:29,520][__main__][INFO] - Iteration 660 took 1m 1s (40.02% Gen, 58.57% Train). Generation: 24s, Training: 36s. Estimated remaining time: 40h 57m 15s. Estimated total time: 51h 36m 3s. Time estimates for 10 more iterations: 10m 19s, 100 more iterations: 1h 43m 12s, 500 more iterations: 8h 36m 0s. [2025-11-13 08:48:29,522][__main__][INFO] - Starting iteration 660. [2025-11-13 08:48:30,029][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 65 and human policies 1. [2025-11-13 08:48:30,029][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:49:02,062][__main__][INFO] - Number of regex retries in iteration 660: 0 [2025-11-13 08:49:02,063][__main__][INFO] - agents played in iteration 660 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:49:02,906][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:02,929][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:02,954][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:02,977][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:49:02,977][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:49:02,978][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:49:03,835][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:49:04,292][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:49:04,795][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:49:05,303][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:49:05,804][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:49:06,307][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:49:06,814][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:49:07,324][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:49:07,844][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:49:08,349][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:49:08,850][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:49:09,354][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:49:09,857][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:49:10,367][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:49:10,901][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:49:11,409][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:49:11,913][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:49:12,419][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:49:12,941][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:49:13,450][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:49:13,949][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:49:14,451][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:49:14,959][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:49:15,468][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:49:15,973][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:49:16,477][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:49:16,984][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:49:17,491][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:49:17,998][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:49:18,503][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:49:19,009][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:49:19,517][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:49:20,026][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:49:20,538][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:49:21,053][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:49:21,555][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:49:22,072][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:49:22,581][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:49:23,087][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:49:23,593][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:49:24,094][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:49:24,598][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:49:25,101][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:49:25,609][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:49:26,116][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:49:26,617][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:49:27,122][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:49:27,627][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:49:28,131][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:49:28,633][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:49:29,138][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:49:29,641][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:49:30,151][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:49:30,658][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:49:31,159][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:49:31,658][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:49:32,159][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:49:32,672][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:49:33,173][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:49:33,675][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:49:34,174][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:49:34,678][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:49:35,180][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:49:35,680][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:49:36,185][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10016 tokens. [2025-11-13 08:49:37,033][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.17%, ΔTime: 00:00:33 [2025-11-13 08:49:37,770][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:49:37,772][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:49:37,774][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:49:39,554][__main__][INFO] - Iteration 661 took 1m 9s (46.07% Gen, 51.36% Train). Generation: 32s, Training: 35s. Estimated remaining time: 47h 16m 20s. Estimated total time: 57h 56m 18s. Time estimates for 10 more iterations: 11m 35s, 100 more iterations: 1h 55m 52s, 500 more iterations: 9h 39m 23s. [2025-11-13 08:49:39,557][__main__][INFO] - Starting iteration 661. [2025-11-13 08:49:40,063][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 66 and human policies 1. [2025-11-13 08:49:40,064][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:50:07,825][__main__][INFO] - Number of regex retries in iteration 661: 0 [2025-11-13 08:50:07,826][__main__][INFO] - agents played in iteration 661 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:50:08,609][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:08,634][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:08,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:08,680][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:50:08,680][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:50:08,682][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:50:09,476][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:50:09,929][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:50:10,438][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:50:10,937][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:50:11,431][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:50:11,938][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:50:12,433][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:50:12,934][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:50:13,436][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:50:13,939][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:50:14,436][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:50:14,964][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:50:15,464][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:50:15,965][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:50:16,469][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:50:16,976][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:50:17,484][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:50:17,988][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:50:18,493][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:50:18,997][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:50:19,500][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:50:20,006][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:50:20,511][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:50:21,018][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:50:21,521][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:50:22,024][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:50:22,530][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:50:23,046][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:50:23,555][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:50:24,072][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:50:24,578][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:50:25,081][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:50:25,587][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:50:26,090][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:50:26,599][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:50:27,105][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:50:27,608][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:50:28,116][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:50:28,620][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:50:29,122][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:50:29,624][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:50:30,139][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:50:30,654][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:50:31,159][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:50:31,663][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:50:32,160][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:50:32,662][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:50:33,177][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:50:33,677][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:50:34,173][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:50:34,685][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:50:35,185][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:50:35,685][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:50:36,186][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:50:36,685][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:50:37,197][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:50:37,701][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:50:38,203][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:50:38,705][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:50:39,207][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:50:39,709][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:50:40,206][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:50:40,708][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:50:41,213][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:50:41,739][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9824 tokens. [2025-11-13 08:50:42,579][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.11%, ΔTime: 00:00:33 [2025-11-13 08:50:43,321][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:50:43,323][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:50:43,325][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:50:44,227][__main__][INFO] - Iteration 662 took 1m 4s (43.27% Gen, 55.32% Train). Generation: 27s, Training: 35s. Estimated remaining time: 42h 47m 13s. Estimated total time: 53h 28m 16s. Time estimates for 10 more iterations: 10m 41s, 100 more iterations: 1h 46m 56s, 500 more iterations: 8h 54m 42s. [2025-11-13 08:50:44,230][__main__][INFO] - Starting iteration 662. [2025-11-13 08:50:44,716][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 66 and human policies 1. [2025-11-13 08:50:44,717][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:51:15,214][__main__][INFO] - Number of regex retries in iteration 662: 0 [2025-11-13 08:51:15,215][__main__][INFO] - agents played in iteration 662 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:51:16,078][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:16,125][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:16,158][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:16,186][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:51:16,187][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:51:16,188][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:51:16,994][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:51:17,451][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:51:17,957][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:51:18,457][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:51:18,959][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:51:19,461][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:51:19,961][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:51:20,475][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:51:20,976][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:51:21,485][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:51:21,997][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:51:22,502][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:51:23,004][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:51:23,510][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:51:24,015][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:51:24,521][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:51:25,026][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:51:25,529][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:51:26,031][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:51:26,536][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:51:27,041][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:51:27,550][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:51:28,057][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:51:28,562][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:51:29,061][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:51:29,581][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:51:30,078][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:51:30,587][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:51:31,091][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:51:31,597][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:51:32,115][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:51:32,620][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:51:33,124][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:51:33,627][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:51:34,133][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:51:34,639][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:51:35,145][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:51:35,649][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:51:36,156][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:51:36,661][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:51:37,166][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:51:37,672][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:51:38,179][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:51:38,695][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:51:39,199][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:51:39,704][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:51:40,210][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:51:40,723][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:51:41,231][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:51:41,735][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:51:42,240][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:51:42,743][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:51:43,249][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:51:43,755][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:51:44,261][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:51:44,765][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:51:45,267][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:51:45,768][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:51:46,274][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:51:46,796][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:51:47,296][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:51:47,809][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:51:48,312][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:51:48,819][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:51:49,329][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9891 tokens. [2025-11-13 08:51:50,154][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.18%, ΔTime: 00:00:33 [2025-11-13 08:51:50,819][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:51:50,821][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:51:50,823][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:51:51,712][__main__][INFO] - Iteration 663 took 1m 6s (45.52% Gen, 53.15% Train). Generation: 30s, Training: 35s. Estimated remaining time: 45h 7m 40s. Estimated total time: 55h 49m 50s. Time estimates for 10 more iterations: 11m 9s, 100 more iterations: 1h 51m 39s, 500 more iterations: 9h 18m 18s. [2025-11-13 08:51:51,714][__main__][INFO] - Starting iteration 663. [2025-11-13 08:51:52,203][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 66 and human policies 1. [2025-11-13 08:51:52,203][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:52:24,255][__main__][INFO] - Number of regex retries in iteration 663: 0 [2025-11-13 08:52:24,256][__main__][INFO] - agents played in iteration 663 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:52:25,120][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:25,145][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:25,170][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:25,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:52:25,192][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:52:25,193][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:52:25,991][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:52:26,455][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:52:26,956][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:52:27,458][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:52:27,954][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:52:28,456][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:52:28,962][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:52:29,456][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:52:29,956][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:52:30,460][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:52:30,962][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:52:31,464][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:52:31,973][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:52:32,479][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:52:32,996][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:52:33,501][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:52:34,016][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:52:34,520][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:52:35,028][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:52:35,533][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:52:36,039][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:52:36,542][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:52:37,048][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:52:37,557][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:52:38,064][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:52:38,564][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:52:39,067][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:52:39,576][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:52:40,082][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:52:40,594][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:52:41,102][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:52:41,605][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:52:42,112][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:52:42,617][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:52:43,119][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:52:43,618][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:52:44,118][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:52:44,622][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:52:45,124][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:52:45,630][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:52:46,134][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:52:46,639][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:52:47,147][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:52:47,652][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:52:48,159][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:52:48,663][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:52:49,167][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:52:49,676][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:52:50,182][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:52:50,687][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:52:51,187][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:52:51,691][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:52:52,198][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:52:52,715][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:52:53,225][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:52:53,737][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:52:54,267][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:52:54,772][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:52:55,276][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:52:55,780][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:52:56,290][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:52:56,795][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:52:57,302][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:52:57,806][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:52:58,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10013 tokens. [2025-11-13 08:52:59,117][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:33 [2025-11-13 08:52:59,845][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:52:59,847][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:52:59,849][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:53:00,740][__main__][INFO] - Iteration 664 took 1m 8s (46.76% Gen, 51.93% Train). Generation: 32s, Training: 35s. Estimated remaining time: 46h 23m 33s. Estimated total time: 57h 6m 52s. Time estimates for 10 more iterations: 11m 25s, 100 more iterations: 1h 54m 13s, 500 more iterations: 9h 31m 8s. [2025-11-13 08:53:00,742][__main__][INFO] - Starting iteration 664. [2025-11-13 08:53:01,214][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 66 and human policies 1. [2025-11-13 08:53:01,215][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:53:24,467][__main__][INFO] - Number of regex retries in iteration 664: 0 [2025-11-13 08:53:24,468][__main__][INFO] - agents played in iteration 664 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:53:25,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:25,302][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:25,328][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:25,364][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:53:25,365][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:53:25,366][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:53:26,177][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:53:26,640][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:53:27,143][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:53:27,648][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:53:28,153][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:53:28,651][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:53:29,156][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:53:29,664][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:53:30,166][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:53:30,669][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:53:31,171][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:53:31,676][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:53:32,180][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:53:32,684][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:53:33,190][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:53:33,685][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:53:34,192][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:53:34,709][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:53:35,212][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:53:35,712][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:53:36,212][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:53:36,713][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:53:37,211][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:53:37,705][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:53:38,206][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:53:38,707][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:53:39,204][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:53:39,706][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:53:40,205][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:53:40,699][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:53:41,196][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:53:41,702][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:53:42,205][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:53:42,723][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:53:43,228][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:53:43,734][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:53:44,238][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:53:44,742][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:53:45,247][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:53:45,753][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:53:46,258][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:53:46,766][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:53:47,270][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:53:47,777][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:53:48,283][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:53:48,790][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:53:49,297][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:53:49,799][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:53:50,306][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:53:50,820][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:53:51,319][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:53:51,832][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:53:52,336][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:53:52,838][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:53:53,340][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:53:53,842][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:53:54,337][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:53:54,838][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:53:55,341][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:53:55,842][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:53:56,344][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:53:56,849][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:53:57,347][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:53:57,847][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:53:58,349][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9969 tokens. [2025-11-13 08:53:59,175][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.11%, ΔTime: 00:00:33 [2025-11-13 08:53:59,893][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:53:59,895][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:53:59,896][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:54:00,792][__main__][INFO] - Iteration 665 took 59s (39.03% Gen, 59.47% Train). Generation: 23s, Training: 35s. Estimated remaining time: 38h 54m 36s. Estimated total time: 49h 38m 55s. Time estimates for 10 more iterations: 9m 55s, 100 more iterations: 1h 39m 17s, 500 more iterations: 8h 16m 29s. [2025-11-13 08:54:00,794][__main__][INFO] - Starting iteration 665. [2025-11-13 08:54:01,261][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 66 and human policies 1. [2025-11-13 08:54:01,262][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:54:23,765][__main__][INFO] - Number of regex retries in iteration 665: 0 [2025-11-13 08:54:23,765][__main__][INFO] - agents played in iteration 665 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:54:24,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:24,720][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:24,757][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:24,788][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:54:24,789][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:54:24,790][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:54:25,616][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:54:26,076][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:54:26,580][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:54:27,078][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:54:27,581][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:54:28,081][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:54:28,581][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:54:29,084][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:54:29,584][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:54:30,084][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:54:30,587][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:54:31,089][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:54:31,596][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:54:32,098][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:54:32,601][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:54:33,114][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:54:33,615][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:54:34,120][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:54:34,629][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:54:35,131][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:54:35,639][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:54:36,146][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:54:36,657][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:54:37,156][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:54:37,660][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:54:38,164][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:54:38,668][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:54:39,173][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:54:39,679][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:54:40,184][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:54:40,691][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:54:41,196][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:54:41,702][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:54:42,202][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:54:42,706][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:54:43,207][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:54:43,711][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:54:44,212][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:54:44,723][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:54:45,229][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:54:45,745][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:54:46,254][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:54:46,761][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:54:47,268][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:54:47,776][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:54:48,282][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:54:48,791][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:54:49,299][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:54:49,807][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:54:50,308][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:54:50,815][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:54:51,329][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:54:51,834][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:54:52,338][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:54:52,842][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:54:53,348][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:54:53,853][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:54:54,357][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:54:54,858][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:54:55,367][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:54:55,871][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:54:56,376][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:54:56,879][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:54:57,383][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:54:57,896][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9938 tokens. [2025-11-13 08:54:58,808][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.20%, ΔTime: 00:00:33 [2025-11-13 08:54:59,564][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:54:59,565][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:54:59,567][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:55:00,505][__main__][INFO] - Iteration 666 took 59s (37.98% Gen, 60.43% Train). Generation: 22s, Training: 35s. Estimated remaining time: 38h 36m 55s. Estimated total time: 49h 22m 13s. Time estimates for 10 more iterations: 9m 52s, 100 more iterations: 1h 38m 44s, 500 more iterations: 8h 13m 42s. [2025-11-13 08:55:00,507][__main__][INFO] - Starting iteration 666. [2025-11-13 08:55:00,997][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 66 and human policies 1. [2025-11-13 08:55:00,998][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:55:29,442][__main__][INFO] - Number of regex retries in iteration 666: 0 [2025-11-13 08:55:29,443][__main__][INFO] - agents played in iteration 666 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:55:30,274][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:30,299][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:30,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:30,348][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:55:30,349][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:55:30,350][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:55:31,159][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:55:31,614][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:55:32,126][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:55:32,623][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:55:33,133][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:55:33,628][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:55:34,132][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:55:34,646][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:55:35,147][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:55:35,652][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:55:36,157][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:55:36,653][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:55:37,161][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:55:37,665][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:55:38,170][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:55:38,675][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:55:39,172][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:55:39,679][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:55:40,185][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:55:40,686][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:55:41,193][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:55:41,701][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:55:42,207][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:55:42,712][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:55:43,217][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:55:43,717][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:55:44,221][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:55:44,725][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:55:45,228][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:55:45,733][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:55:46,254][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:55:46,760][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:55:47,266][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:55:47,774][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:55:48,273][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:55:48,775][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:55:49,279][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:55:49,785][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:55:50,291][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:55:50,793][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:55:51,298][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:55:51,802][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:55:52,302][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:55:52,813][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:55:53,313][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:55:53,813][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:55:54,316][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:55:54,825][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:55:55,328][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:55:55,832][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:55:56,340][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:55:56,847][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:55:57,350][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:55:57,860][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:55:58,366][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:55:58,872][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:55:59,376][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:55:59,883][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:56:00,397][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:56:00,900][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:56:01,408][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:56:01,917][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:56:02,419][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:56:02,920][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:56:03,438][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10010 tokens. [2025-11-13 08:56:04,271][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:33 [2025-11-13 08:56:05,012][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:56:05,016][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:56:05,017][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:56:05,929][__main__][INFO] - Iteration 667 took 1m 4s (43.81% Gen, 54.79% Train). Generation: 28s, Training: 35s. Estimated remaining time: 43h 20m 13s. Estimated total time: 54h 6m 37s. Time estimates for 10 more iterations: 10m 49s, 100 more iterations: 1h 48m 13s, 500 more iterations: 9h 1m 6s. [2025-11-13 08:56:05,930][__main__][INFO] - Starting iteration 667. [2025-11-13 08:56:06,397][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 66 and human policies 1. [2025-11-13 08:56:06,398][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:56:19,541][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:56:19,621][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:56:30,835][__main__][INFO] - Number of regex retries in iteration 667: 2 [2025-11-13 08:56:30,836][__main__][INFO] - agents played in iteration 667 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:56:31,669][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:31,691][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:31,713][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:31,735][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:56:31,736][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:56:31,737][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:56:32,531][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:56:32,988][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:56:33,496][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:56:33,998][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:56:34,504][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:56:35,006][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:56:35,507][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:56:36,010][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:56:36,513][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:56:37,023][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:56:37,552][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:56:38,056][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:56:38,561][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:56:39,068][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:56:39,573][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:56:40,083][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:56:40,586][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:56:41,091][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:56:41,597][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:56:42,100][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:56:42,603][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:56:43,103][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:56:43,604][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:56:44,108][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:56:44,608][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:56:45,114][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:56:45,627][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:56:46,123][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:56:46,636][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:56:47,143][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:56:47,648][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:56:48,162][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:56:48,667][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:56:49,172][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:56:49,670][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:56:50,176][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:56:50,682][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:56:51,186][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:56:51,689][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:56:52,197][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:56:52,701][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:56:53,204][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:56:53,707][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:56:54,210][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:56:54,719][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:56:55,221][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:56:55,725][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:56:56,239][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:56:56,742][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:56:58,176][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:56:58,688][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:56:59,197][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:56:59,706][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:57:00,211][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:57:00,712][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:57:01,216][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:57:01,718][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:57:02,224][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:57:02,742][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:57:03,246][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:57:03,761][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:57:04,266][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:57:04,772][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:57:05,281][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:57:05,782][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10014 tokens. [2025-11-13 08:57:06,650][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.32%, Block Peak % of device VRAM: 62.42%, ΔTime: 00:00:34 [2025-11-13 08:57:07,369][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:57:07,370][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:57:07,372][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:57:08,202][__main__][INFO] - Iteration 668 took 1m 1s (39.54% Gen, 59.12% Train). Generation: 24s, Training: 36s. Estimated remaining time: 40h 42m 49s. Estimated total time: 51h 30m 16s. Time estimates for 10 more iterations: 10m 18s, 100 more iterations: 1h 43m 0s, 500 more iterations: 8h 35m 2s. [2025-11-13 08:57:08,205][__main__][INFO] - Starting iteration 668. [2025-11-13 08:57:08,694][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 66 and human policies 1. [2025-11-13 08:57:08,694][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:57:36,926][__main__][INFO] - Number of regex retries in iteration 668: 0 [2025-11-13 08:57:36,927][__main__][INFO] - agents played in iteration 668 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:57:37,860][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:37,882][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:37,904][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:37,926][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:57:37,927][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:57:37,928][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:57:38,722][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:57:39,181][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:57:39,689][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:57:40,194][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:57:40,699][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:57:41,202][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:57:41,705][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:57:42,208][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:57:42,712][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:57:43,217][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:57:43,721][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:57:44,227][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:57:44,734][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:57:45,234][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:57:45,729][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:57:46,231][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:57:46,733][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:57:47,233][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:57:47,725][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:57:48,218][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:57:48,715][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:57:49,211][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:57:49,717][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:57:50,216][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:57:50,720][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:57:51,230][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:57:51,757][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:57:52,267][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:57:52,763][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:57:53,263][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:57:53,776][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:57:54,283][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:57:54,789][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:57:55,294][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:57:55,800][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:57:56,307][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:57:56,812][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:57:57,321][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:57:57,827][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:57:58,331][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:57:58,840][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:57:59,346][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:57:59,844][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:58:00,339][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:58:00,842][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:58:01,338][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:58:01,854][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:58:02,359][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:58:02,880][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:58:03,383][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:58:03,887][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:58:04,387][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:58:04,894][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:58:05,391][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:58:05,895][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:58:06,398][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:58:06,900][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:58:07,400][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:58:07,895][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:58:08,394][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:58:08,895][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:58:09,396][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:58:09,898][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:58:10,398][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:58:10,898][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9823 tokens. [2025-11-13 08:58:11,720][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.15%, ΔTime: 00:00:33 [2025-11-13 08:58:12,488][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:58:12,490][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:58:12,492][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:58:13,432][__main__][INFO] - Iteration 669 took 1m 4s (43.61% Gen, 54.94% Train). Generation: 28s, Training: 35s. Estimated remaining time: 43h 8m 25s. Estimated total time: 53h 56m 57s. Time estimates for 10 more iterations: 10m 47s, 100 more iterations: 1h 47m 53s, 500 more iterations: 8h 59m 29s. [2025-11-13 08:58:13,434][__main__][INFO] - Starting iteration 669. [2025-11-13 08:58:13,904][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 66 and human policies 1. [2025-11-13 08:58:13,905][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:58:34,555][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:58:47,848][__main__][INFO] - Number of regex retries in iteration 669: 1 [2025-11-13 08:58:47,849][__main__][INFO] - agents played in iteration 669 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:58:48,688][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:48,715][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:48,742][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:48,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:58:48,766][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:58:48,767][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:58:49,555][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:58:50,006][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 08:58:50,513][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 08:58:51,017][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 08:58:51,516][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 08:58:52,020][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 08:58:52,518][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 08:58:53,024][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 08:58:53,533][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 08:58:54,031][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 08:58:54,538][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 08:58:55,045][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 08:58:55,578][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 08:58:56,094][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 08:58:56,603][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 08:58:57,113][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 08:58:57,619][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 08:58:58,124][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 08:58:58,631][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 08:58:59,135][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 08:58:59,646][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 08:59:00,151][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 08:59:00,657][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 08:59:01,165][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 08:59:01,670][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 08:59:02,195][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 08:59:02,703][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 08:59:03,209][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 08:59:03,718][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 08:59:04,225][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 08:59:04,725][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 08:59:05,231][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 08:59:06,350][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 08:59:06,910][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 08:59:07,424][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 08:59:07,928][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 08:59:08,431][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 08:59:08,937][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 08:59:09,441][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 08:59:09,943][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 08:59:10,442][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 08:59:10,943][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 08:59:11,448][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 08:59:11,947][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 08:59:12,446][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 08:59:12,950][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 08:59:13,452][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 08:59:13,952][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 08:59:14,455][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 08:59:14,958][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 08:59:15,462][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 08:59:15,961][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 08:59:16,460][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 08:59:16,960][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 08:59:17,460][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 08:59:17,958][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 08:59:18,457][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 08:59:18,960][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 08:59:19,470][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 08:59:19,976][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 08:59:20,484][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 08:59:20,990][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 08:59:21,494][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 08:59:22,015][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 08:59:22,518][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10060 tokens. [2025-11-13 08:59:23,480][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.07%, Current % of VRAM taken: 58.32%, Block Peak % of device VRAM: 62.11%, ΔTime: 00:00:33 [2025-11-13 08:59:24,123][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 08:59:24,125][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 08:59:24,127][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 08:59:24,945][__main__][INFO] - Iteration 670 took 1m 11s (47.78% Gen, 51.07% Train). Generation: 33s, Training: 36s. Estimated remaining time: 48h 22m 21s. Estimated total time: 59h 12m 4s. Time estimates for 10 more iterations: 11m 50s, 100 more iterations: 1h 58m 24s, 500 more iterations: 9h 52m 0s. [2025-11-13 08:59:24,947][__main__][INFO] - Starting iteration 670. [2025-11-13 08:59:25,424][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 66 and human policies 1. [2025-11-13 08:59:25,424][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 08:59:46,191][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 08:59:57,611][__main__][INFO] - Number of regex retries in iteration 670: 1 [2025-11-13 08:59:57,612][__main__][INFO] - agents played in iteration 670 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 08:59:58,428][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:58,451][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:58,477][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:58,501][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 08:59:58,501][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 08:59:58,502][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 08:59:59,424][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 08:59:59,889][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:00:00,404][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:00:00,908][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:00:01,418][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:00:01,923][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:00:02,431][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:00:02,942][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:00:03,445][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:00:03,949][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:00:04,448][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:00:04,954][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:00:05,459][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:00:05,956][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:00:06,454][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:00:06,959][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:00:07,470][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:00:07,975][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:00:08,484][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:00:08,989][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:00:09,493][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:00:10,000][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:00:10,506][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:00:11,012][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:00:11,518][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:00:12,028][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:00:12,534][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:00:13,039][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:00:13,543][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:00:14,051][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:00:14,559][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:00:15,062][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:00:15,567][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:00:16,072][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:00:16,568][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:00:17,064][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:00:17,568][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:00:18,072][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:00:18,576][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:00:19,080][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:00:19,576][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:00:20,082][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:00:20,588][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:00:21,105][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:00:21,605][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:00:22,106][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:00:22,607][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:00:23,103][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:00:23,601][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:00:24,103][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:00:24,627][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:00:25,132][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:00:25,640][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:00:26,146][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:00:26,642][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:00:27,142][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:00:27,644][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:00:28,148][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:00:28,649][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:00:29,156][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:00:29,657][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:00:30,159][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:00:30,656][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:00:31,159][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:00:31,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9933 tokens. [2025-11-13 09:00:32,588][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.12%, ΔTime: 00:00:33 [2025-11-13 09:00:33,368][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:00:33,370][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:00:33,372][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:00:35,213][__main__][INFO] - Iteration 671 took 1m 9s (46.12% Gen, 51.24% Train). Generation: 32s, Training: 35s. Estimated remaining time: 47h 18m 35s. Estimated total time: 58h 9m 28s. Time estimates for 10 more iterations: 11m 37s, 100 more iterations: 1h 56m 18s, 500 more iterations: 9h 41m 34s. [2025-11-13 09:00:35,215][__main__][INFO] - Starting iteration 671. [2025-11-13 09:00:35,698][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 67 and human policies 1. [2025-11-13 09:00:35,699][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:00:58,869][__main__][INFO] - Number of regex retries in iteration 671: 0 [2025-11-13 09:00:58,869][__main__][INFO] - agents played in iteration 671 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:00:59,698][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:59,724][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:59,749][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:59,772][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:00:59,773][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:00:59,774][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:01:00,681][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:01:01,146][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:01:01,657][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:01:02,162][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:01:02,672][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:01:03,178][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:01:03,679][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:01:04,181][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:01:04,687][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:01:05,192][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:01:05,706][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:01:06,218][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:01:06,726][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:01:07,230][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:01:07,737][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:01:08,247][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:01:08,747][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:01:09,254][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:01:09,761][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:01:10,267][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:01:10,775][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:01:11,286][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:01:11,798][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:01:12,309][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:01:12,819][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:01:13,324][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:01:13,821][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:01:14,324][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:01:14,833][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:01:15,335][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:01:15,839][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:01:16,355][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:01:16,859][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:01:17,361][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:01:17,869][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:01:18,369][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:01:18,877][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:01:19,379][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:01:19,882][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:01:20,386][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:01:20,888][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:01:21,396][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:01:21,901][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:01:22,400][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:01:22,920][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:01:23,427][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:01:23,941][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:01:24,445][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:01:24,950][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:01:25,460][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:01:25,966][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:01:26,473][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:01:26,978][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:01:27,479][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:01:27,984][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:01:28,484][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:01:28,981][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:01:29,483][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:01:29,983][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:01:30,483][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:01:30,983][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:01:31,487][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:01:31,990][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:01:32,496][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:01:33,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10131 tokens. [2025-11-13 09:01:33,879][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.35%, ΔTime: 00:00:33 [2025-11-13 09:01:34,650][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:01:34,652][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:01:34,653][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:01:35,585][__main__][INFO] - Iteration 672 took 59s (38.69% Gen, 59.75% Train). Generation: 23s, Training: 35s. Estimated remaining time: 39h 2m 29s. Estimated total time: 49h 54m 23s. Time estimates for 10 more iterations: 9m 58s, 100 more iterations: 1h 39m 48s, 500 more iterations: 8h 19m 3s. [2025-11-13 09:01:35,588][__main__][INFO] - Starting iteration 672. [2025-11-13 09:01:36,111][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 67 and human policies 1. [2025-11-13 09:01:36,111][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:02:07,381][__main__][INFO] - Number of regex retries in iteration 672: 0 [2025-11-13 09:02:07,383][__main__][INFO] - agents played in iteration 672 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:02:08,275][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:08,313][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:08,343][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:08,369][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:02:08,370][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:02:08,371][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:02:09,297][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:02:09,758][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:02:10,268][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:02:10,781][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:02:11,286][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:02:11,787][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:02:12,295][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:02:12,804][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:02:13,313][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:02:13,822][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:02:14,325][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:02:14,828][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:02:15,329][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:02:15,828][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:02:16,337][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:02:16,843][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:02:17,346][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:02:17,847][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:02:18,349][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:02:18,850][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:02:19,352][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:02:19,855][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:02:20,359][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:02:20,866][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:02:21,370][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:02:21,877][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:02:22,378][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:02:22,883][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:02:23,378][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:02:23,881][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:02:24,377][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:02:24,870][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:02:25,371][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:02:25,867][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:02:26,367][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:02:26,871][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:02:27,377][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:02:27,878][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:02:28,375][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:02:28,882][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:02:29,382][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:02:29,884][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:02:30,390][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:02:30,899][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:02:31,403][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:02:31,902][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:02:32,410][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:02:32,920][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:02:33,425][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:02:33,921][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:02:34,419][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:02:34,925][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:02:35,422][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:02:35,919][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:02:36,420][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:02:36,925][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:02:37,426][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:02:37,925][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:02:38,425][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:02:38,921][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:02:39,420][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:02:39,921][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:02:40,425][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:02:40,924][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:02:41,423][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9873 tokens. [2025-11-13 09:02:42,329][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.95%, Current % of VRAM taken: 58.20%, Block Peak % of device VRAM: 62.14%, ΔTime: 00:00:33 [2025-11-13 09:02:42,964][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:02:42,966][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:02:42,968][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:02:43,804][__main__][INFO] - Iteration 673 took 1m 7s (46.19% Gen, 52.57% Train). Generation: 31s, Training: 35s. Estimated remaining time: 45h 31m 40s. Estimated total time: 56h 24m 42s. Time estimates for 10 more iterations: 11m 16s, 100 more iterations: 1h 52m 49s, 500 more iterations: 9h 24m 7s. [2025-11-13 09:02:43,806][__main__][INFO] - Starting iteration 673. [2025-11-13 09:02:44,304][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 67 and human policies 1. [2025-11-13 09:02:44,305][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:03:01,073][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 09:03:12,585][__main__][INFO] - Number of regex retries in iteration 673: 1 [2025-11-13 09:03:12,586][__main__][INFO] - agents played in iteration 673 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:03:13,429][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:13,455][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:13,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:13,504][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:03:13,504][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:03:13,506][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:03:14,388][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:03:14,850][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:03:15,368][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:03:15,875][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:03:16,382][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:03:16,890][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:03:17,400][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:03:17,906][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:03:18,417][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:03:18,928][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:03:19,438][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:03:19,944][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:03:20,451][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:03:20,956][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:03:21,461][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:03:21,972][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:03:22,485][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:03:22,988][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:03:23,498][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:03:24,007][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:03:24,509][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:03:25,009][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:03:25,503][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:03:26,003][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:03:26,504][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:03:27,010][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:03:27,526][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:03:28,035][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:03:28,551][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:03:29,051][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:03:29,557][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:03:30,063][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:03:30,559][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:03:31,065][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:03:31,566][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:03:32,070][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:03:32,579][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:03:33,082][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:03:33,583][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:03:34,087][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:03:34,597][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:03:35,100][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:03:35,601][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:03:36,103][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:03:36,601][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:03:37,104][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:03:37,608][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:03:38,109][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:03:38,603][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:03:39,108][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:03:39,608][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:03:40,111][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:03:40,616][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:03:41,118][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:03:41,619][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:03:42,124][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:03:42,623][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:03:43,137][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:03:43,632][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:03:44,135][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:03:44,636][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:03:45,131][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:03:45,630][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:03:46,131][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:03:46,633][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9986 tokens. [2025-11-13 09:03:47,460][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.42%, Block Peak % of device VRAM: 62.21%, ΔTime: 00:00:33 [2025-11-13 09:03:48,214][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:03:48,216][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:03:48,218][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:03:49,144][__main__][INFO] - Iteration 674 took 1m 4s (43.61% Gen, 54.95% Train). Generation: 28s, Training: 35s. Estimated remaining time: 43h 7m 55s. Estimated total time: 54h 2m 2s. Time estimates for 10 more iterations: 10m 48s, 100 more iterations: 1h 48m 4s, 500 more iterations: 9h 0m 20s. [2025-11-13 09:03:49,146][__main__][INFO] - Starting iteration 674. [2025-11-13 09:03:49,656][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 67 and human policies 1. [2025-11-13 09:03:49,657][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:04:10,842][__main__][INFO] - Number of regex retries in iteration 674: 0 [2025-11-13 09:04:10,842][__main__][INFO] - agents played in iteration 674 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:04:11,624][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:11,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:11,670][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:11,692][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:04:11,693][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:04:11,693][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:04:12,538][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:04:12,996][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:04:13,513][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:04:14,027][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:04:14,529][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:04:15,027][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:04:15,536][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:04:16,053][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:04:16,563][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:04:17,072][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:04:17,582][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:04:18,091][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:04:18,599][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:04:19,107][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:04:19,612][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:04:20,121][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:04:20,634][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:04:21,136][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:04:21,648][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:04:22,151][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:04:22,662][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:04:23,165][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:04:23,671][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:04:24,180][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:04:24,683][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:04:25,189][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:04:25,692][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:04:26,195][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:04:26,702][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:04:27,209][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:04:27,713][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:04:28,218][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:04:28,723][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:04:29,230][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:04:29,727][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:04:30,234][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:04:30,745][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:04:31,246][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:04:31,757][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:04:32,265][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:04:32,767][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:04:33,269][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:04:33,774][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:04:34,284][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:04:34,779][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:04:35,281][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:04:35,785][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:04:36,290][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:04:36,792][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:04:37,297][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:04:37,804][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:04:38,309][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:04:38,814][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:04:39,317][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:04:39,824][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:04:40,326][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:04:40,830][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:04:41,335][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:04:41,835][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:04:42,338][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:04:42,841][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:04:43,344][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:04:43,853][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:04:44,355][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:04:44,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9994 tokens. [2025-11-13 09:04:45,771][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 09:04:46,522][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:04:46,523][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:04:46,525][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:04:47,501][__main__][INFO] - Iteration 675 took 57s (36.62% Gen, 61.69% Train). Generation: 21s, Training: 35s. Estimated remaining time: 37h 17m 11s. Estimated total time: 48h 12m 16s. Time estimates for 10 more iterations: 9m 38s, 100 more iterations: 1h 36m 24s, 500 more iterations: 8h 2m 2s. [2025-11-13 09:04:47,503][__main__][INFO] - Starting iteration 675. [2025-11-13 09:04:47,976][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 67 and human policies 1. [2025-11-13 09:04:47,976][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:05:21,722][__main__][INFO] - Number of regex retries in iteration 675: 0 [2025-11-13 09:05:21,723][__main__][INFO] - agents played in iteration 675 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:05:22,578][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:22,607][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:22,635][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:22,659][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:05:22,660][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:05:22,661][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:05:23,572][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:05:24,050][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:05:24,560][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:05:25,069][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:05:25,584][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:05:26,088][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:05:26,600][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:05:27,105][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:05:27,620][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:05:28,127][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:05:28,628][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:05:29,132][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:05:29,642][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:05:30,150][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:05:30,667][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:05:31,176][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:05:31,675][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:05:32,184][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:05:32,694][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:05:33,204][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:05:33,708][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:05:34,208][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:05:34,708][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:05:35,204][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:05:35,709][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:05:36,212][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:05:36,711][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:05:37,223][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:05:37,727][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:05:38,230][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:05:38,737][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:05:39,242][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:05:39,751][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:05:40,252][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:05:40,758][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:05:41,257][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:05:41,755][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:05:42,260][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:05:42,763][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:05:43,266][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:05:43,781][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:05:44,278][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:05:44,783][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:05:45,284][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:05:45,790][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:05:46,295][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:05:46,798][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:05:47,313][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:05:47,823][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:05:48,328][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:05:48,831][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:05:49,334][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:05:49,832][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:05:50,339][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:05:50,840][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:05:51,342][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:05:51,856][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:05:52,360][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:05:52,873][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:05:53,371][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:05:53,866][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:05:54,373][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:05:54,867][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:05:55,360][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:05:55,870][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9891 tokens. [2025-11-13 09:05:56,680][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.02%, Current % of VRAM taken: 58.26%, Block Peak % of device VRAM: 62.08%, ΔTime: 00:00:33 [2025-11-13 09:05:57,431][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:05:57,433][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:05:57,434][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:05:58,349][__main__][INFO] - Iteration 676 took 1m 10s (47.95% Gen, 50.75% Train). Generation: 33s, Training: 35s. Estimated remaining time: 47h 42m 25s. Estimated total time: 58h 38m 42s. Time estimates for 10 more iterations: 11m 43s, 100 more iterations: 1h 57m 17s, 500 more iterations: 9h 46m 27s. [2025-11-13 09:05:58,351][__main__][INFO] - Starting iteration 676. [2025-11-13 09:05:58,840][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 67 and human policies 1. [2025-11-13 09:05:58,840][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:06:30,454][__main__][INFO] - Number of regex retries in iteration 676: 0 [2025-11-13 09:06:30,455][__main__][INFO] - agents played in iteration 676 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:06:31,299][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:31,322][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:31,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:31,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:06:31,367][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:06:31,368][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:06:32,273][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:06:32,740][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:06:33,252][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:06:33,756][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:06:34,267][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:06:34,770][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:06:35,288][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:06:35,796][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:06:36,304][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:06:36,809][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:06:37,309][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:06:37,816][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:06:38,330][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:06:38,834][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:06:39,340][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:06:39,844][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:06:40,348][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:06:40,858][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:06:41,367][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:06:41,885][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:06:42,395][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:06:42,902][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:06:43,406][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:06:43,936][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:06:44,446][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:06:44,947][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:06:45,448][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:06:45,948][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:06:46,454][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:06:46,960][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:06:47,462][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:06:47,969][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:06:48,470][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:06:48,975][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:06:49,477][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:06:49,979][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:06:50,482][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:06:50,998][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:06:51,496][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:06:51,994][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:06:52,499][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:06:53,004][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:06:53,513][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:06:54,020][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:06:54,527][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:06:55,038][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:06:55,547][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:06:56,055][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:06:56,561][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:06:57,063][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:06:57,566][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:06:58,065][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:06:58,560][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:06:59,059][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:06:59,560][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:07:00,055][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:07:00,549][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:07:01,041][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:07:01,544][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:07:02,044][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:07:02,543][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:07:03,037][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:07:03,535][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:07:04,027][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:07:04,529][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10005 tokens. [2025-11-13 09:07:05,409][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.23%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.31%, ΔTime: 00:00:33 [2025-11-13 09:07:06,129][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:07:06,130][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:07:06,132][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:07:07,199][__main__][INFO] - Iteration 677 took 1m 8s (46.25% Gen, 52.19% Train). Generation: 31s, Training: 35s. Estimated remaining time: 46h 0m 33s. Estimated total time: 56h 57m 59s. Time estimates for 10 more iterations: 11m 23s, 100 more iterations: 1h 53m 55s, 500 more iterations: 9h 29m 39s. [2025-11-13 09:07:07,201][__main__][INFO] - Starting iteration 677. [2025-11-13 09:07:07,687][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 67 and human policies 1. [2025-11-13 09:07:07,688][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:07:36,673][__main__][INFO] - Number of regex retries in iteration 677: 0 [2025-11-13 09:07:36,674][__main__][INFO] - agents played in iteration 677 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:07:37,551][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:37,576][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:37,603][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:37,626][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:07:37,627][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:07:37,627][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:07:38,549][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:07:39,006][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:07:39,520][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:07:40,018][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:07:40,515][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:07:41,026][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:07:41,523][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:07:42,025][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:07:42,531][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:07:43,042][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:07:43,550][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:07:44,063][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:07:44,568][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:07:45,082][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:07:45,583][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:07:46,091][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:07:46,598][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:07:47,096][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:07:47,604][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:07:48,110][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:07:48,610][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:07:49,105][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:07:49,603][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:07:50,117][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:07:50,618][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:07:51,115][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:07:51,637][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:07:52,141][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:07:52,639][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:07:53,140][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:07:53,640][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:07:54,144][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:07:54,645][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:07:55,147][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:07:55,655][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:07:56,157][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:07:56,661][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:07:57,161][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:07:57,663][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:07:58,164][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:07:58,669][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:07:59,165][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:07:59,666][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:08:00,161][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:08:00,661][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:08:01,160][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:08:01,665][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:08:02,169][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:08:02,678][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:08:03,178][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:08:03,670][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:08:04,169][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:08:04,668][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:08:05,171][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:08:05,664][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:08:06,158][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:08:06,651][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:08:07,154][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:08:07,656][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:08:08,157][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:08:08,667][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:08:09,169][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:08:09,672][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:08:10,174][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:08:10,676][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9804 tokens. [2025-11-13 09:08:11,546][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.39%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:33 [2025-11-13 09:08:12,258][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:08:12,260][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:08:12,262][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:08:13,136][__main__][INFO] - Iteration 678 took 1m 5s (44.29% Gen, 54.37% Train). Generation: 28s, Training: 35s. Estimated remaining time: 43h 33m 57s. Estimated total time: 54h 32m 28s. Time estimates for 10 more iterations: 10m 54s, 100 more iterations: 1h 49m 4s, 500 more iterations: 9h 5m 24s. [2025-11-13 09:08:13,138][__main__][INFO] - Starting iteration 678. [2025-11-13 09:08:13,626][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 67 and human policies 1. [2025-11-13 09:08:13,627][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:08:43,514][__main__][INFO] - Number of regex retries in iteration 678: 0 [2025-11-13 09:08:43,515][__main__][INFO] - agents played in iteration 678 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:08:44,367][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:44,396][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:44,421][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:44,444][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:08:44,444][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:08:44,445][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:08:45,342][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:08:45,802][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:08:46,310][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:08:46,816][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:08:47,317][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:08:47,843][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:08:48,349][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:08:48,857][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:08:49,364][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:08:49,872][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:08:50,380][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:08:50,885][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:08:51,398][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:08:51,917][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:08:52,424][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:08:52,932][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:08:53,441][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:08:53,947][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:08:54,442][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:08:54,940][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:08:55,446][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:08:55,940][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:08:56,444][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:08:56,948][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:08:57,448][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:08:57,947][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:08:58,451][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:08:58,954][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:08:59,453][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:08:59,954][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:09:00,448][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:09:00,940][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:09:01,442][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:09:01,942][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:09:02,439][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:09:02,935][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:09:03,444][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:09:03,942][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:09:04,443][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:09:04,939][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:09:05,434][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:09:05,946][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:09:06,451][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:09:06,951][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:09:07,454][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:09:07,958][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:09:08,458][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:09:08,962][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:09:09,463][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:09:09,959][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:09:10,459][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:09:10,955][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:09:11,454][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:09:11,949][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:09:12,446][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:09:12,948][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:09:13,449][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:09:13,961][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:09:14,466][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:09:14,961][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:09:15,456][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:09:15,957][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:09:16,456][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:09:16,959][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:09:17,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9780 tokens. [2025-11-13 09:09:18,275][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.89%, Current % of VRAM taken: 58.14%, Block Peak % of device VRAM: 62.22%, ΔTime: 00:00:32 [2025-11-13 09:09:19,014][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:09:19,016][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:09:19,017][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:09:19,922][__main__][INFO] - Iteration 679 took 1m 6s (45.08% Gen, 53.55% Train). Generation: 29s, Training: 35s. Estimated remaining time: 44h 15m 11s. Estimated total time: 55h 14m 49s. Time estimates for 10 more iterations: 11m 2s, 100 more iterations: 1h 50m 29s, 500 more iterations: 9h 12m 28s. [2025-11-13 09:09:19,924][__main__][INFO] - Starting iteration 679. [2025-11-13 09:09:20,399][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 67 and human policies 1. [2025-11-13 09:09:20,400][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:09:46,524][__main__][INFO] - Number of regex retries in iteration 679: 0 [2025-11-13 09:09:46,526][__main__][INFO] - agents played in iteration 679 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:09:47,427][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:47,468][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:47,500][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:47,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:09:47,529][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:09:47,530][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:09:48,488][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:09:48,953][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:09:49,467][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:09:49,974][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:09:50,483][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:09:50,987][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:09:51,497][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:09:52,012][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:09:52,521][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:09:53,024][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:09:53,533][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:09:54,040][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:09:54,544][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:09:55,050][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:09:55,561][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:09:56,070][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:09:56,577][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:09:57,081][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:09:57,586][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:09:58,086][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:09:58,599][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:09:59,100][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:09:59,603][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:10:00,120][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:10:00,629][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:10:01,134][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:10:01,635][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:10:02,139][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:10:02,651][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:10:03,148][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:10:03,652][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:10:04,153][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:10:04,650][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:10:05,163][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:10:05,666][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:10:06,166][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:10:06,667][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:10:07,167][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:10:07,674][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:10:08,176][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:10:08,676][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:10:09,206][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:10:09,700][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:10:10,195][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:10:10,693][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:10:11,195][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:10:11,691][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:10:12,187][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:10:12,681][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:10:13,177][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:10:13,680][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:10:14,173][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:10:14,677][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:10:15,181][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:10:15,678][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:10:16,184][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:10:16,684][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:10:17,184][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:10:17,686][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:10:18,182][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:10:18,680][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:10:19,178][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:10:19,682][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:10:20,178][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:10:20,675][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9822 tokens. [2025-11-13 09:10:21,563][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.90%, Current % of VRAM taken: 58.15%, Block Peak % of device VRAM: 62.12%, ΔTime: 00:00:33 [2025-11-13 09:10:22,216][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:10:22,218][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:10:22,220][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:10:23,053][__main__][INFO] - Iteration 680 took 1m 2s (41.70% Gen, 56.97% Train). Generation: 26s, Training: 35s. Estimated remaining time: 41h 12m 3s. Estimated total time: 52h 12m 44s. Time estimates for 10 more iterations: 10m 26s, 100 more iterations: 1h 44m 25s, 500 more iterations: 8h 42m 7s. [2025-11-13 09:10:23,055][__main__][INFO] - Starting iteration 680. [2025-11-13 09:10:23,546][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 67 and human policies 1. [2025-11-13 09:10:23,547][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:10:56,610][__main__][INFO] - Number of regex retries in iteration 680: 0 [2025-11-13 09:10:56,611][__main__][INFO] - agents played in iteration 680 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:10:57,470][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:57,493][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:57,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:57,542][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:10:57,543][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:10:57,544][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:10:58,460][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:10:58,921][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:10:59,425][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:10:59,936][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:11:00,459][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:11:00,959][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:11:01,465][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:11:01,965][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:11:02,472][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:11:02,984][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:11:03,493][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:11:04,005][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:11:04,509][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:11:05,020][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:11:05,526][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:11:06,031][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:11:06,542][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:11:07,066][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:11:07,574][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:11:08,080][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:11:08,583][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:11:09,091][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:11:09,598][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:11:10,110][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:11:10,617][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:11:11,120][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:11:11,621][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:11:12,120][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:11:12,616][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:11:13,114][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:11:13,622][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:11:14,123][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:11:14,624][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:11:15,128][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:11:15,622][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:11:16,124][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:11:16,626][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:11:17,121][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:11:17,633][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:11:18,136][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:11:18,643][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:11:19,139][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:11:19,638][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:11:20,144][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:11:20,650][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:11:21,156][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:11:21,660][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:11:22,163][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:11:22,686][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:11:23,195][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:11:23,702][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:11:24,207][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:11:24,710][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:11:25,215][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:11:25,740][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:11:26,248][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:11:26,754][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:11:27,250][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:11:27,754][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:11:28,259][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:11:28,761][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:11:29,273][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:11:29,768][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:11:30,272][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:11:30,773][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10001 tokens. [2025-11-13 09:11:31,628][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.06%, Current % of VRAM taken: 58.31%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:33 [2025-11-13 09:11:32,379][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:11:32,381][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:11:32,383][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:11:34,171][__main__][INFO] - Iteration 681 took 1m 10s (46.81% Gen, 50.65% Train). Generation: 33s, Training: 35s. Estimated remaining time: 47h 49m 26s. Estimated total time: 58h 51m 18s. Time estimates for 10 more iterations: 11m 46s, 100 more iterations: 1h 57m 42s, 500 more iterations: 9h 48m 33s. [2025-11-13 09:11:34,173][__main__][INFO] - Starting iteration 681. [2025-11-13 09:11:34,644][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 68 and human policies 1. [2025-11-13 09:11:34,644][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:12:00,998][__main__][INFO] - Number of regex retries in iteration 681: 0 [2025-11-13 09:12:00,999][__main__][INFO] - agents played in iteration 681 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:12:02,745][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:02,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:02,796][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:02,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:12:02,821][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:12:02,822][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:12:03,778][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:12:04,312][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:12:04,829][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:12:05,338][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:12:05,846][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:12:06,363][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:12:06,869][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:12:07,384][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:12:07,897][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:12:08,403][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:12:08,914][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:12:09,417][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:12:09,919][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:12:10,426][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:12:10,930][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:12:11,439][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:12:11,945][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:12:12,452][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:12:12,973][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:12:13,477][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:12:13,990][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:12:14,493][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:12:14,993][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:12:15,501][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:12:16,006][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:12:16,509][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:12:17,005][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:12:17,507][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:12:18,016][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:12:18,517][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:12:19,026][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:12:19,529][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:12:20,033][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:12:20,531][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:12:21,029][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:12:21,530][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:12:22,030][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:12:22,525][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:12:23,020][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:12:23,520][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:12:24,022][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:12:24,533][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:12:25,038][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:12:25,539][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:12:26,043][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:12:26,544][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:12:27,045][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:12:27,550][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:12:28,051][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:12:28,556][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:12:29,061][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:12:29,562][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:12:30,066][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:12:30,568][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:12:31,089][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:12:31,597][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:12:32,099][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:12:32,602][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:12:33,101][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:12:33,604][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:12:34,103][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:12:34,605][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:12:35,114][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:12:35,616][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:12:36,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9953 tokens. [2025-11-13 09:12:36,996][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.22%, ΔTime: 00:00:33 [2025-11-13 09:12:37,654][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:12:37,656][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:12:37,657][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:12:38,542][__main__][INFO] - Iteration 682 took 1m 3s (41.24% Gen, 57.37% Train). Generation: 26s, Training: 36s. Estimated remaining time: 42h 11m 57s. Estimated total time: 53h 14m 54s. Time estimates for 10 more iterations: 10m 38s, 100 more iterations: 1h 46m 29s, 500 more iterations: 8h 52m 29s. [2025-11-13 09:12:38,544][__main__][INFO] - Starting iteration 682. [2025-11-13 09:12:39,040][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 68 and human policies 1. [2025-11-13 09:12:39,041][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:13:14,831][__main__][INFO] - Number of regex retries in iteration 682: 0 [2025-11-13 09:13:14,832][__main__][INFO] - agents played in iteration 682 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:13:15,683][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:15,712][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:15,739][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:15,762][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:13:15,763][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:13:15,764][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:13:16,653][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:13:17,116][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:13:17,620][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:13:18,121][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:13:18,629][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:13:19,141][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:13:19,647][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:13:20,159][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:13:20,658][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:13:21,166][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:13:21,673][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:13:22,183][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:13:22,688][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:13:23,192][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:13:23,696][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:13:24,201][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:13:24,717][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:13:25,222][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:13:25,731][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:13:26,241][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:13:26,745][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:13:27,254][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:13:27,757][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:13:28,255][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:13:28,757][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:13:29,258][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:13:29,759][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:13:30,271][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:13:30,771][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:13:31,285][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:13:31,785][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:13:32,286][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:13:32,791][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:13:33,295][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:13:33,805][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:13:34,306][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:13:34,808][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:13:35,311][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:13:35,817][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:13:36,320][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:13:36,823][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:13:37,326][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:13:37,827][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:13:38,328][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:13:38,830][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:13:39,336][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:13:39,838][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:13:40,340][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:13:40,838][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:13:41,362][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:13:41,864][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:13:42,365][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:13:42,865][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:13:43,367][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:13:43,868][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:13:44,373][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:13:44,873][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:13:45,384][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:13:45,895][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:13:46,406][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:13:46,904][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:13:47,406][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:13:47,910][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:13:48,420][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:13:48,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10034 tokens. [2025-11-13 09:13:49,805][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.17%, ΔTime: 00:00:33 [2025-11-13 09:13:50,544][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:13:50,545][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:13:50,547][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:13:51,459][__main__][INFO] - Iteration 683 took 1m 12s (49.42% Gen, 49.32% Train). Generation: 35s, Training: 35s. Estimated remaining time: 49h 16m 49s. Estimated total time: 60h 20m 59s. Time estimates for 10 more iterations: 12m 4s, 100 more iterations: 2h 0m 41s, 500 more iterations: 10h 3m 29s. [2025-11-13 09:13:51,461][__main__][INFO] - Starting iteration 683. [2025-11-13 09:13:51,959][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 68 and human policies 1. [2025-11-13 09:13:51,959][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:14:12,252][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 09:14:19,271][__main__][INFO] - Number of regex retries in iteration 683: 1 [2025-11-13 09:14:19,271][__main__][INFO] - agents played in iteration 683 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:14:20,121][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:20,144][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:20,167][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:20,190][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:14:20,190][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:14:20,192][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:14:21,055][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:14:21,520][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:14:22,032][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:14:22,540][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:14:23,054][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:14:23,556][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:14:24,063][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:14:24,568][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:14:25,073][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:14:25,584][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:14:26,087][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:14:26,608][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:14:27,112][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:14:27,618][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:14:28,126][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:14:28,623][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:14:29,126][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:14:29,627][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:14:30,136][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:14:30,634][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:14:31,135][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:14:31,634][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:14:32,136][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:14:32,637][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:14:33,141][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:14:33,640][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:14:34,144][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:14:34,651][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:14:35,155][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:14:35,656][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:14:36,159][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:14:36,661][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:14:37,179][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:14:37,682][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:14:38,188][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:14:38,700][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:14:39,200][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:14:39,709][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:14:40,207][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:14:40,709][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:14:41,208][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:14:41,712][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:14:42,220][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:14:42,724][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:14:43,225][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:14:43,726][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:14:44,227][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:14:44,727][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:14:45,227][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:14:45,721][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:14:46,224][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:14:46,724][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:14:47,228][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:14:47,728][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:14:48,225][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:14:48,720][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:14:49,217][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:14:49,718][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:14:50,217][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:14:50,714][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:14:51,214][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:14:51,717][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:14:52,219][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:14:52,724][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:14:53,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9913 tokens. [2025-11-13 09:14:54,070][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.95%, Current % of VRAM taken: 58.19%, Block Peak % of device VRAM: 62.12%, ΔTime: 00:00:33 [2025-11-13 09:14:54,813][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:14:54,815][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:14:54,816][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:14:55,952][__main__][INFO] - Iteration 684 took 1m 3s (42.68% Gen, 55.54% Train). Generation: 27s, Training: 35s. Estimated remaining time: 42h 14m 26s. Estimated total time: 53h 19m 40s. Time estimates for 10 more iterations: 10m 39s, 100 more iterations: 1h 46m 39s, 500 more iterations: 8h 53m 16s. [2025-11-13 09:14:55,954][__main__][INFO] - Starting iteration 684. [2025-11-13 09:14:56,444][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 68 and human policies 1. [2025-11-13 09:14:56,445][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:15:30,428][__main__][INFO] - Number of regex retries in iteration 684: 0 [2025-11-13 09:15:30,429][__main__][INFO] - agents played in iteration 684 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:15:31,325][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:31,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:31,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:31,420][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:15:31,421][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:15:31,421][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:15:32,324][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:15:32,787][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:15:33,294][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:15:33,802][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:15:34,306][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:15:34,814][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:15:35,318][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:15:35,826][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:15:36,337][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:15:36,934][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:15:37,444][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:15:37,948][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:15:38,455][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:15:38,959][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:15:39,459][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:15:39,964][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:15:40,465][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:15:40,969][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:15:41,475][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:15:41,978][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:15:42,477][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:15:42,985][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:15:43,488][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:15:43,984][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:15:44,479][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:15:44,979][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:15:45,488][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:15:45,993][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:15:46,494][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:15:46,999][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:15:47,499][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:15:48,010][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:15:48,509][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:15:49,013][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:15:49,529][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:15:50,026][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:15:50,524][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:15:51,020][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:15:51,521][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:15:52,031][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:15:52,525][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:15:53,026][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:15:53,525][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:15:54,024][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:15:54,533][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:15:55,033][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:15:55,534][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:15:56,033][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:15:56,525][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:15:57,024][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:15:57,524][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:15:58,028][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:15:58,539][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:15:59,040][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:15:59,542][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:16:00,045][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:16:00,548][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:16:01,050][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:16:01,553][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:16:02,061][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:16:02,563][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:16:03,062][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:16:03,562][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:16:04,062][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:16:04,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9902 tokens. [2025-11-13 09:16:05,435][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.01%, Current % of VRAM taken: 58.26%, Block Peak % of device VRAM: 62.07%, ΔTime: 00:00:33 [2025-11-13 09:16:06,190][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:16:06,191][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:16:06,193][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:16:07,157][__main__][INFO] - Iteration 685 took 1m 10s (48.06% Gen, 50.58% Train). Generation: 33s, Training: 35s. Estimated remaining time: 47h 49m 15s. Estimated total time: 58h 55m 40s. Time estimates for 10 more iterations: 11m 47s, 100 more iterations: 1h 57m 51s, 500 more iterations: 9h 49m 16s. [2025-11-13 09:16:07,159][__main__][INFO] - Starting iteration 685. [2025-11-13 09:16:07,640][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 68 and human policies 1. [2025-11-13 09:16:07,641][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:16:34,481][__main__][INFO] - Number of regex retries in iteration 685: 0 [2025-11-13 09:16:34,481][__main__][INFO] - agents played in iteration 685 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:16:35,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:35,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:35,379][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:35,402][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.26%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:16:35,402][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:16:35,403][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:16:36,281][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:16:36,746][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:16:37,258][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:16:37,772][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:16:38,285][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:16:38,794][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:16:39,299][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:16:39,805][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:16:40,305][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:16:40,811][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:16:41,318][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:16:41,834][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:16:42,340][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:16:42,849][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:16:43,351][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:16:43,854][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:16:44,361][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:16:44,868][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:16:45,370][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:16:45,872][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:16:46,375][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:16:46,879][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:16:47,385][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:16:47,890][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:16:48,393][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:16:48,907][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:16:49,416][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:16:49,920][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:16:50,441][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:16:50,952][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:16:51,461][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:16:51,965][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:16:52,469][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:16:52,984][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:16:53,490][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:16:54,003][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:16:54,509][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:16:55,018][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:16:55,537][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:16:56,047][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:16:56,550][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:16:57,049][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:16:57,548][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:16:58,056][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:16:58,556][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:16:59,056][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:16:59,550][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:17:00,049][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:17:00,553][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:17:01,055][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:17:01,556][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:17:02,064][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:17:02,564][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:17:03,066][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:17:03,569][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:17:04,068][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:17:04,571][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:17:05,072][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:17:05,571][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:17:06,070][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:17:06,575][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:17:07,076][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:17:07,581][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:17:08,081][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:17:08,580][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9970 tokens. [2025-11-13 09:17:09,404][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.97%, Current % of VRAM taken: 58.22%, Block Peak % of device VRAM: 62.19%, ΔTime: 00:00:33 [2025-11-13 09:17:10,167][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:17:10,168][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:17:10,170][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:17:11,124][__main__][INFO] - Iteration 686 took 1m 3s (42.28% Gen, 56.22% Train). Generation: 26s, Training: 35s. Estimated remaining time: 41h 46m 46s. Estimated total time: 52h 54m 16s. Time estimates for 10 more iterations: 10m 34s, 100 more iterations: 1h 45m 48s, 500 more iterations: 8h 49m 2s. [2025-11-13 09:17:11,126][__main__][INFO] - Starting iteration 686. [2025-11-13 09:17:11,612][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 68 and human policies 1. [2025-11-13 09:17:11,612][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:17:39,489][__main__][INFO] - Number of regex retries in iteration 686: 0 [2025-11-13 09:17:39,491][__main__][INFO] - agents played in iteration 686 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:17:40,446][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:40,469][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:40,492][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:40,515][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:17:40,516][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:17:40,517][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:17:41,407][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:17:41,872][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:17:42,384][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:17:42,896][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:17:43,403][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:17:43,909][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:17:44,419][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:17:44,929][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:17:45,455][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:17:45,955][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:17:46,461][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:17:46,968][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:17:47,472][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:17:47,972][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:17:48,473][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:17:48,981][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:17:49,489][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:17:49,987][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:17:50,488][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:17:50,985][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:17:51,488][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:17:51,988][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:17:52,490][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:17:52,991][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:17:53,489][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:17:53,992][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:17:54,499][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:17:55,001][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:17:55,505][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:17:56,006][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:17:56,506][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:17:57,009][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:17:57,513][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:17:58,015][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:17:58,526][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:17:59,024][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:17:59,528][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:18:00,033][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:18:00,535][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:18:01,052][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:18:01,552][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:18:02,046][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:18:02,551][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:18:03,046][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:18:03,542][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:18:04,054][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:18:04,555][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:18:05,059][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:18:05,559][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:18:06,062][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:18:06,569][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:18:07,067][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:18:07,563][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:18:08,065][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:18:08,573][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:18:09,078][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:18:09,580][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:18:10,082][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:18:10,583][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:18:11,075][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:18:11,572][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:18:12,077][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:18:12,581][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:18:13,082][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:18:13,585][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9853 tokens. [2025-11-13 09:18:14,441][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.40%, ΔTime: 00:00:33 [2025-11-13 09:18:15,119][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:18:15,121][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:18:15,123][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:18:15,984][__main__][INFO] - Iteration 687 took 1m 4s (43.31% Gen, 55.35% Train). Generation: 27s, Training: 35s. Estimated remaining time: 42h 30m 3s. Estimated total time: 53h 38m 38s. Time estimates for 10 more iterations: 10m 43s, 100 more iterations: 1h 47m 17s, 500 more iterations: 8h 56m 26s. [2025-11-13 09:18:15,986][__main__][INFO] - Starting iteration 687. [2025-11-13 09:18:16,491][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 68 and human policies 1. [2025-11-13 09:18:16,491][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:18:47,162][__main__][INFO] - Number of regex retries in iteration 687: 0 [2025-11-13 09:18:47,163][__main__][INFO] - agents played in iteration 687 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:18:48,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:48,073][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:48,100][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:48,125][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:18:48,126][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:18:48,127][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:18:49,038][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:18:49,499][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:18:50,011][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:18:50,520][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:18:51,028][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:18:51,533][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:18:52,042][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:18:52,562][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:18:53,067][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:18:53,576][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:18:54,080][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:18:54,588][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:18:55,093][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:18:55,598][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:18:56,104][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:18:56,609][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:18:57,112][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:18:57,624][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:18:58,131][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:18:58,641][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:18:59,153][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:18:59,657][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:19:00,170][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:19:00,672][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:19:01,176][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:19:01,687][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:19:02,199][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:19:02,727][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:19:03,239][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:19:03,753][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:19:04,281][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:19:04,785][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:19:05,292][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:19:05,797][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:19:06,302][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:19:06,815][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:19:07,321][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:19:07,829][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:19:08,347][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:19:08,852][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:19:09,369][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:19:09,873][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:19:10,376][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:19:10,885][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:19:11,388][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:19:11,893][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:19:12,400][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:19:12,901][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:19:13,405][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:19:13,908][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:19:14,413][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:19:14,915][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:19:15,419][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:19:15,923][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:19:16,426][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:19:16,929][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:19:17,445][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:19:17,946][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:19:18,449][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:19:18,951][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:19:19,452][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:19:19,953][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:19:20,453][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:19:20,949][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:19:21,455][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10024 tokens. [2025-11-13 09:19:22,345][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.50%, ΔTime: 00:00:33 [2025-11-13 09:19:23,077][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:19:23,079][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:19:23,081][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:19:24,000][__main__][INFO] - Iteration 688 took 1m 7s (45.43% Gen, 53.20% Train). Generation: 30s, Training: 35s. Estimated remaining time: 45h 5m 47s. Estimated total time: 56h 15m 29s. Time estimates for 10 more iterations: 11m 15s, 100 more iterations: 1h 52m 30s, 500 more iterations: 9h 22m 34s. [2025-11-13 09:19:24,002][__main__][INFO] - Starting iteration 688. [2025-11-13 09:19:24,474][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 68 and human policies 1. [2025-11-13 09:19:24,474][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:19:45,485][__main__][INFO] - Number of regex retries in iteration 688: 0 [2025-11-13 09:19:45,486][__main__][INFO] - agents played in iteration 688 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:19:46,344][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.43%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:46,372][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.43%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:46,395][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.43%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:46,418][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.43%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:19:46,419][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:19:46,420][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:19:47,358][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:19:47,825][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:19:48,338][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:19:48,859][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:19:49,366][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:19:49,889][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:19:50,394][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:19:50,903][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:19:52,302][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:19:52,805][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:19:53,310][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:19:53,849][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:19:54,360][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:19:54,864][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:19:55,383][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:19:55,886][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:19:56,396][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:19:56,904][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:19:57,406][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:19:57,912][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:19:58,415][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:19:58,917][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:19:59,419][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:19:59,924][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:20:00,426][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:20:00,932][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:20:01,436][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:20:01,947][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:20:02,451][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:20:02,960][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:20:03,462][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:20:03,968][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:20:04,486][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:20:04,990][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:20:05,499][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:20:06,004][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:20:06,509][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:20:07,005][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:20:07,499][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:20:08,000][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:20:08,506][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:20:09,011][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:20:09,511][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:20:10,016][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:20:10,519][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:20:11,016][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:20:11,511][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:20:12,012][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:20:12,515][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:20:13,017][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:20:13,520][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:20:14,021][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:20:14,520][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:20:15,024][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:20:15,523][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:20:16,024][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:20:16,528][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:20:17,030][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:20:17,533][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:20:18,037][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:20:18,541][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:20:19,061][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:20:19,566][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:20:20,076][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:20:20,576][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9940 tokens. [2025-11-13 09:20:21,485][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.06%, ΔTime: 00:00:34 [2025-11-13 09:20:22,127][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:20:22,128][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:20:22,130][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:20:23,009][__main__][INFO] - Iteration 689 took 58s (35.89% Gen, 62.60% Train). Generation: 21s, Training: 36s. Estimated remaining time: 37h 36m 7s. Estimated total time: 48h 46m 49s. Time estimates for 10 more iterations: 9m 45s, 100 more iterations: 1h 37m 33s, 500 more iterations: 8h 7m 48s. [2025-11-13 09:20:23,012][__main__][INFO] - Starting iteration 689. [2025-11-13 09:20:23,500][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 68 and human policies 1. [2025-11-13 09:20:23,502][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:20:56,409][__main__][INFO] - Number of regex retries in iteration 689: 0 [2025-11-13 09:20:56,410][__main__][INFO] - agents played in iteration 689 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:20:57,221][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:57,247][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:57,272][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:57,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:20:57,296][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:20:57,297][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:20:58,175][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:20:58,640][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:20:59,149][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:20:59,664][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:21:00,168][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:21:00,670][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:21:01,177][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:21:01,685][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:21:02,204][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:21:02,710][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:21:03,216][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:21:03,731][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:21:04,237][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:21:04,747][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:21:05,254][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:21:05,763][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:21:06,285][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:21:06,794][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:21:07,300][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:21:07,804][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:21:08,309][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:21:08,821][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:21:09,327][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:21:09,830][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:21:10,332][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:21:10,834][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:21:11,340][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:21:11,845][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:21:12,350][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:21:12,853][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:21:13,354][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:21:13,860][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:21:14,364][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:21:14,867][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:21:15,378][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:21:15,876][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:21:16,379][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:21:16,885][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:21:17,382][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:21:17,895][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:21:18,398][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:21:18,901][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:21:19,408][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:21:19,910][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:21:20,412][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:21:20,921][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:21:21,427][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:21:21,938][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:21:22,441][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:21:22,943][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:21:23,444][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:21:23,944][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:21:24,449][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:21:24,950][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:21:25,448][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:21:25,957][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:21:26,459][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:21:26,960][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:21:27,461][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:21:27,967][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:21:28,477][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:21:28,980][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:21:29,479][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:21:29,986][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:21:30,486][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10053 tokens. [2025-11-13 09:21:31,422][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.15%, Current % of VRAM taken: 58.40%, Block Peak % of device VRAM: 62.22%, ΔTime: 00:00:33 [2025-11-13 09:21:32,199][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:21:32,201][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:21:32,202][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:21:33,204][__main__][INFO] - Iteration 690 took 1m 9s (47.21% Gen, 51.35% Train). Generation: 32s, Training: 35s. Estimated remaining time: 46h 53m 21s. Estimated total time: 58h 5m 13s. Time estimates for 10 more iterations: 11m 37s, 100 more iterations: 1h 56m 10s, 500 more iterations: 9h 40m 52s. [2025-11-13 09:21:33,207][__main__][INFO] - Starting iteration 690. [2025-11-13 09:21:33,711][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 68 and human policies 1. [2025-11-13 09:21:33,711][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:22:04,733][__main__][INFO] - Number of regex retries in iteration 690: 0 [2025-11-13 09:22:04,734][__main__][INFO] - agents played in iteration 690 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:22:05,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:05,582][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:05,605][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:05,627][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:22:05,628][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:22:05,629][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:22:06,515][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:22:06,967][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:22:07,475][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:22:07,984][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:22:08,487][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:22:09,026][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:22:10,448][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:22:10,952][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:22:11,471][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:22:11,990][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:22:12,499][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:22:13,012][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:22:13,511][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:22:14,016][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:22:14,520][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:22:15,021][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:22:15,529][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:22:16,033][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:22:16,530][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:22:17,031][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:22:17,527][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:22:18,024][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:22:18,526][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:22:19,023][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:22:19,525][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:22:20,028][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:22:20,528][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:22:21,057][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:22:21,564][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:22:22,065][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:22:22,568][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:22:23,072][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:22:23,568][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:22:24,075][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:22:24,581][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:22:25,092][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:22:25,595][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:22:26,109][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:22:26,616][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:22:27,123][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:22:27,631][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:22:28,132][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:22:28,634][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:22:29,135][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:22:29,637][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:22:30,141][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:22:30,642][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:22:31,143][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:22:31,648][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:22:32,147][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:22:32,651][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:22:33,145][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:22:33,637][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:22:34,139][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:22:34,634][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:22:35,125][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:22:35,617][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:22:36,116][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:22:36,626][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:22:37,133][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:22:37,634][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:22:38,138][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:22:38,642][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:22:39,147][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:22:39,650][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9899 tokens. [2025-11-13 09:22:40,617][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.21%, ΔTime: 00:00:34 [2025-11-13 09:22:41,266][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:22:41,268][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:22:41,270][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:22:43,452][__main__][INFO] - Iteration 691 took 1m 9s (44.48% Gen, 52.39% Train). Generation: 31s, Training: 36s. Estimated remaining time: 46h 54m 4s. Estimated total time: 58h 7m 6s. Time estimates for 10 more iterations: 11m 37s, 100 more iterations: 1h 56m 14s, 500 more iterations: 9h 41m 11s. [2025-11-13 09:22:43,455][__main__][INFO] - Starting iteration 691. [2025-11-13 09:22:43,942][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 69 and human policies 1. [2025-11-13 09:22:43,943][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:23:19,112][__main__][INFO] - Number of regex retries in iteration 691: 0 [2025-11-13 09:23:19,112][__main__][INFO] - agents played in iteration 691 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:23:19,961][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:19,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:20,007][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:20,029][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:23:20,030][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:23:20,030][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:23:20,859][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:23:21,322][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:23:21,825][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:23:22,325][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:23:22,828][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:23:23,331][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:23:23,832][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:23:24,329][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:23:24,823][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:23:25,330][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:23:25,832][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:23:26,335][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:23:26,834][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:23:27,337][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:23:27,838][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:23:28,332][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:23:28,834][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:23:29,345][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:23:29,877][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:23:30,380][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:23:30,887][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:23:31,395][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:23:31,898][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:23:32,405][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:23:32,911][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:23:33,422][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:23:33,929][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:23:34,438][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:23:34,946][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:23:35,448][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:23:35,951][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:23:36,455][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:23:36,957][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:23:37,459][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:23:37,959][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:23:38,460][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:23:38,962][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:23:39,462][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:23:39,963][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:23:40,462][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:23:40,963][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:23:41,465][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:23:41,967][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:23:42,470][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:23:42,972][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:23:43,504][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:23:44,012][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:23:44,518][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:23:45,028][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:23:45,528][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:23:46,028][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:23:46,525][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:23:47,031][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:23:47,538][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:23:48,046][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:23:48,554][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:23:49,061][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:23:49,563][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:23:50,069][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:23:50,574][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:23:51,078][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:23:51,579][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:23:52,088][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:23:52,597][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:23:53,111][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10027 tokens. [2025-11-13 09:23:54,024][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.13%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 09:23:54,805][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:23:54,807][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:23:54,809][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:23:55,727][__main__][INFO] - Iteration 692 took 1m 11s (48.99% Gen, 49.73% Train). Generation: 35s, Training: 35s. Estimated remaining time: 48h 35m 5s. Estimated total time: 59h 49m 19s. Time estimates for 10 more iterations: 11m 57s, 100 more iterations: 1h 59m 38s, 500 more iterations: 9h 58m 13s. [2025-11-13 09:23:55,729][__main__][INFO] - Starting iteration 692. [2025-11-13 09:23:56,225][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 69 and human policies 1. [2025-11-13 09:23:56,227][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:24:16,815][mllm.models.large_language_model_local][WARNING] - Response Proposal: 20 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 09:24:27,004][__main__][INFO] - Number of regex retries in iteration 692: 1 [2025-11-13 09:24:27,005][__main__][INFO] - agents played in iteration 692 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:24:27,776][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:27,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:27,827][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:27,850][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:24:27,851][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:24:27,851][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:24:28,699][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:24:29,155][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:24:29,664][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:24:30,166][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:24:30,672][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:24:31,177][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:24:31,685][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:24:32,190][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:24:32,701][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:24:33,206][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:24:33,725][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:24:34,223][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:24:34,725][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:24:35,224][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:24:35,725][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:24:36,234][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:24:36,738][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:24:37,243][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:24:37,744][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:24:38,244][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:24:38,747][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:24:39,249][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:24:39,747][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:24:40,251][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:24:40,751][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:24:41,251][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:24:41,751][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:24:42,255][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:24:42,759][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:24:43,260][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:24:43,762][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:24:44,263][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:24:44,763][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:24:45,269][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:24:45,771][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:24:46,271][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:24:46,774][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:24:47,277][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:24:47,780][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:24:48,284][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:24:48,788][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:24:49,290][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:24:49,791][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:24:50,299][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:24:50,815][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:24:51,319][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:24:51,825][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:24:52,339][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:24:52,845][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:24:53,352][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:24:53,854][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:24:54,354][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:24:54,863][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:24:55,366][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:24:55,870][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:24:56,372][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:24:56,882][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:24:57,385][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:24:57,889][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:24:58,494][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:24:59,001][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:24:59,531][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:25:00,051][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:25:00,556][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:25:01,065][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10046 tokens. [2025-11-13 09:25:02,013][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.20%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.17%, ΔTime: 00:00:33 [2025-11-13 09:25:02,696][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:25:02,698][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:25:02,700][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:25:03,520][__main__][INFO] - Iteration 693 took 1m 7s (45.73% Gen, 53.04% Train). Generation: 30s, Training: 35s. Estimated remaining time: 44h 49m 26s. Estimated total time: 56h 4m 48s. Time estimates for 10 more iterations: 11m 12s, 100 more iterations: 1h 52m 9s, 500 more iterations: 9h 20m 48s. [2025-11-13 09:25:03,524][__main__][INFO] - Starting iteration 693. [2025-11-13 09:25:04,006][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 69 and human policies 1. [2025-11-13 09:25:04,007][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:25:40,582][__main__][INFO] - Number of regex retries in iteration 693: 0 [2025-11-13 09:25:40,583][__main__][INFO] - agents played in iteration 693 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:25:41,441][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:41,468][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:41,494][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:41,517][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:25:41,518][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:25:41,518][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:25:42,326][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:25:42,787][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:25:43,290][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:25:43,797][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:25:44,304][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:25:44,809][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:25:45,319][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:25:45,815][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:25:46,318][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:25:46,818][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:25:47,335][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:25:47,838][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:25:48,333][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:25:48,841][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:25:49,342][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:25:49,849][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:25:50,347][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:25:50,838][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:25:51,341][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:25:51,843][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:25:52,350][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:25:52,853][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:25:53,352][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:25:53,862][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:25:54,364][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:25:54,872][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:25:55,380][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:25:55,887][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:25:56,394][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:25:56,903][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:25:57,401][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:25:57,908][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:25:58,412][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:25:58,917][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:25:59,430][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:25:59,928][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:26:00,453][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:26:00,959][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:26:01,463][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:26:01,966][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:26:02,467][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:26:02,974][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:26:03,480][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:26:03,985][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:26:04,501][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:26:05,007][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:26:05,515][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:26:06,020][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:26:06,526][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:26:07,038][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:26:07,535][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:26:08,041][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:26:08,547][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:26:09,044][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:26:09,552][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:26:10,048][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:26:10,551][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:26:11,055][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:26:11,557][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:26:12,059][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:26:12,563][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:26:13,063][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:26:13,563][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:26:14,068][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:26:14,567][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10006 tokens. [2025-11-13 09:26:15,418][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.09%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.22%, ΔTime: 00:00:33 [2025-11-13 09:26:16,200][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:26:16,201][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:26:16,203][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:26:17,143][__main__][INFO] - Iteration 694 took 1m 13s (50.01% Gen, 48.70% Train). Generation: 36s, Training: 35s. Estimated remaining time: 49h 40m 18s. Estimated total time: 60h 56m 54s. Time estimates for 10 more iterations: 12m 11s, 100 more iterations: 2h 1m 53s, 500 more iterations: 10h 9m 29s. [2025-11-13 09:26:17,145][__main__][INFO] - Starting iteration 694. [2025-11-13 09:26:17,618][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 69 and human policies 1. [2025-11-13 09:26:17,618][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:26:42,565][__main__][INFO] - Number of regex retries in iteration 694: 0 [2025-11-13 09:26:42,565][__main__][INFO] - agents played in iteration 694 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:26:43,391][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:43,419][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:43,446][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:43,469][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:26:43,469][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:26:43,470][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:26:44,298][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:26:44,751][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:26:45,262][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:26:45,767][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:26:46,279][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:26:46,781][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:26:47,277][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:26:47,781][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:26:48,283][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:26:48,783][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:26:49,283][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:26:49,783][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:26:50,285][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:26:50,785][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:26:51,284][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:26:51,784][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:26:52,283][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:26:52,781][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:26:53,275][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:26:53,768][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:26:54,280][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:26:54,790][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:26:55,288][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:26:55,785][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:26:56,285][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:26:56,800][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:26:57,310][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:26:57,821][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:26:58,328][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:26:58,833][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:26:59,352][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:26:59,859][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:27:00,363][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:27:00,900][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:27:01,408][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:27:01,924][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:27:02,435][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:27:02,944][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:27:03,460][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:27:03,970][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:27:04,481][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:27:04,987][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:27:05,492][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:27:05,997][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:27:06,500][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:27:07,005][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:27:07,509][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:27:08,018][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:27:08,523][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:27:09,031][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:27:09,536][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:27:10,040][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:27:10,543][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:27:11,060][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:27:11,562][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:27:12,064][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:27:12,568][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:27:13,063][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:27:13,572][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:27:14,071][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:27:14,572][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:27:15,079][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:27:15,583][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:27:16,082][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:27:16,584][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10030 tokens. [2025-11-13 09:27:17,428][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.35%, ΔTime: 00:00:33 [2025-11-13 09:27:18,231][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:27:18,233][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:27:18,234][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:27:19,186][__main__][INFO] - Iteration 695 took 1m 1s (40.52% Gen, 57.93% Train). Generation: 24s, Training: 35s. Estimated remaining time: 40h 0m 51s. Estimated total time: 51h 18m 28s. Time estimates for 10 more iterations: 10m 15s, 100 more iterations: 1h 42m 36s, 500 more iterations: 8h 33m 4s. [2025-11-13 09:27:19,188][__main__][INFO] - Starting iteration 695. [2025-11-13 09:27:19,685][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 69 and human policies 1. [2025-11-13 09:27:19,686][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:27:39,952][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 09:27:49,690][__main__][INFO] - Number of regex retries in iteration 695: 1 [2025-11-13 09:27:49,690][__main__][INFO] - agents played in iteration 695 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:27:50,566][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:50,591][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:50,628][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:50,650][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:27:50,651][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:27:50,652][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:27:51,462][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:27:51,918][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:27:52,423][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:27:52,928][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:27:53,430][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:27:53,930][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:27:54,436][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:27:54,935][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:27:55,443][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:27:55,948][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:27:56,445][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:27:56,948][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:27:57,452][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:27:57,985][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:27:58,492][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:27:58,996][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:27:59,502][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:28:00,012][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:28:00,519][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:28:01,023][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:28:01,526][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:28:02,028][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:28:02,529][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:28:03,034][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:28:03,540][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:28:04,043][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:28:04,551][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:28:05,053][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:28:05,571][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:28:06,077][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:28:06,597][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:28:07,102][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:28:07,610][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:28:08,117][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:28:08,623][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:28:09,130][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:28:09,642][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:28:10,148][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:28:10,655][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:28:11,163][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:28:11,667][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:28:12,180][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:28:12,684][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:28:13,195][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:28:13,696][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:28:14,198][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:28:14,706][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:28:15,208][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:28:15,711][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:28:16,217][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:28:16,724][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:28:17,236][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:28:17,767][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:28:18,276][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:28:18,788][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:28:19,295][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:28:19,810][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:28:20,318][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:28:20,817][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:28:21,323][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:28:21,830][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:28:22,335][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:28:22,839][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:28:23,344][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:28:23,856][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10019 tokens. [2025-11-13 09:28:24,804][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.28%, Current % of VRAM taken: 58.53%, Block Peak % of device VRAM: 62.35%, ΔTime: 00:00:33 [2025-11-13 09:28:25,461][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:28:25,462][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:28:25,464][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:28:26,316][__main__][INFO] - Iteration 696 took 1m 6s (45.03% Gen, 53.69% Train). Generation: 30s, Training: 35s. Estimated remaining time: 44h 12m 50s. Estimated total time: 55h 31m 34s. Time estimates for 10 more iterations: 11m 6s, 100 more iterations: 1h 51m 3s, 500 more iterations: 9h 15m 15s. [2025-11-13 09:28:26,319][__main__][INFO] - Starting iteration 696. [2025-11-13 09:28:26,825][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 69 and human policies 1. [2025-11-13 09:28:26,826][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:28:53,650][__main__][INFO] - Number of regex retries in iteration 696: 0 [2025-11-13 09:28:53,651][__main__][INFO] - agents played in iteration 696 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:28:54,486][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:54,511][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:54,537][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:54,560][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.40%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:28:54,560][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:28:54,561][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:28:55,377][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:28:55,979][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:28:56,488][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:28:56,984][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:28:57,485][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:28:58,008][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:28:58,509][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:28:59,012][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:28:59,516][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:29:00,022][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:29:00,534][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:29:01,038][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:29:01,541][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:29:02,047][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:29:02,554][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:29:03,056][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:29:03,572][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:29:04,080][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:29:04,600][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:29:05,106][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:29:05,617][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:29:06,129][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:29:06,634][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:29:07,141][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:29:07,652][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:29:08,164][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:29:08,675][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:29:09,175][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:29:09,680][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:29:10,188][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:29:10,692][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:29:11,203][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:29:11,709][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:29:12,238][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:29:12,750][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:29:13,255][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:29:13,762][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:29:14,268][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:29:14,772][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:29:15,294][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:29:15,797][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:29:16,301][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:29:16,807][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:29:17,305][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:29:17,817][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:29:18,319][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:29:18,819][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:29:19,323][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:29:19,827][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:29:20,336][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:29:20,843][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:29:21,348][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:29:21,859][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:29:22,368][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:29:22,883][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:29:23,387][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:29:23,889][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:29:24,404][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:29:24,905][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:29:25,405][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:29:25,906][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:29:26,404][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:29:26,920][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:29:27,419][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:29:27,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9873 tokens. [2025-11-13 09:29:28,821][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.09%, ΔTime: 00:00:33 [2025-11-13 09:29:29,543][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:29:29,544][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:29:29,546][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:29:30,466][__main__][INFO] - Iteration 697 took 1m 3s (42.15% Gen, 56.40% Train). Generation: 26s, Training: 35s. Estimated remaining time: 41h 42m 13s. Estimated total time: 53h 2m 2s. Time estimates for 10 more iterations: 10m 36s, 100 more iterations: 1h 46m 4s, 500 more iterations: 8h 50m 20s. [2025-11-13 09:29:30,468][__main__][INFO] - Starting iteration 697. [2025-11-13 09:29:30,948][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 69 and human policies 1. [2025-11-13 09:29:30,949][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:29:50,899][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 09:30:03,467][__main__][INFO] - Number of regex retries in iteration 697: 1 [2025-11-13 09:30:03,468][__main__][INFO] - agents played in iteration 697 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:30:04,311][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:04,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:04,361][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:04,383][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:30:04,383][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:30:04,385][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:30:05,186][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:30:05,644][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:30:06,151][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:30:06,653][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:30:07,154][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:30:07,657][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:30:08,159][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:30:08,661][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:30:09,166][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:30:09,674][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:30:10,188][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:30:10,690][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:30:11,211][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:30:11,718][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:30:12,224][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:30:12,740][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:30:13,248][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:30:13,751][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:30:14,256][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:30:14,763][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:30:15,268][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:30:15,775][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:30:16,281][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:30:16,803][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:30:17,310][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:30:17,820][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:30:18,326][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:30:18,835][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:30:19,345][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:30:19,850][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:30:20,360][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:30:20,868][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:30:21,375][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:30:21,882][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:30:22,386][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:30:22,894][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:30:23,399][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:30:23,904][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:30:24,432][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:30:24,940][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:30:25,447][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:30:25,952][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:30:26,450][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:30:26,961][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:30:27,469][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:30:27,972][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:30:28,475][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:30:28,980][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:30:29,481][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:30:29,985][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:30:30,490][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:30:31,019][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:30:31,529][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:30:32,035][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:30:32,546][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:30:33,046][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:30:33,553][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:30:34,055][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:30:34,562][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:30:35,068][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:30:35,573][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:30:36,081][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:30:36,587][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:30:37,095][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:30:37,617][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10130 tokens. [2025-11-13 09:30:38,554][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.28%, ΔTime: 00:00:33 [2025-11-13 09:30:39,198][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:30:39,200][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:30:39,201][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:30:40,043][__main__][INFO] - Iteration 698 took 1m 9s (47.06% Gen, 51.72% Train). Generation: 32s, Training: 35s. Estimated remaining time: 46h 13m 50s. Estimated total time: 57h 34m 49s. Time estimates for 10 more iterations: 11m 30s, 100 more iterations: 1h 55m 9s, 500 more iterations: 9h 35m 48s. [2025-11-13 09:30:40,045][__main__][INFO] - Starting iteration 698. [2025-11-13 09:30:40,534][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 69 and human policies 1. [2025-11-13 09:30:40,534][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:31:13,772][__main__][INFO] - Number of regex retries in iteration 698: 0 [2025-11-13 09:31:13,773][__main__][INFO] - agents played in iteration 698 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:31:14,621][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:14,647][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:14,671][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:14,694][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:31:14,694][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:31:14,695][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:31:15,553][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:31:16,017][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:31:16,522][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:31:17,037][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:31:17,543][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:31:18,039][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:31:18,543][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:31:19,041][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:31:19,548][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:31:20,054][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:31:20,558][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:31:21,071][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:31:21,577][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:31:22,081][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:31:22,586][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:31:23,088][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:31:23,592][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:31:24,090][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:31:24,589][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:31:25,091][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:31:25,603][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:31:26,122][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:31:26,631][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:31:27,137][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:31:27,642][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:31:28,150][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:31:28,656][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:31:29,163][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:31:29,666][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:31:30,169][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:31:30,675][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:31:31,177][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:31:31,679][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:31:32,184][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:31:32,690][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:31:33,195][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:31:33,697][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:31:34,218][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:31:34,743][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:31:35,254][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:31:35,757][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:31:36,259][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:31:36,767][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:31:37,271][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:31:37,780][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:31:38,283][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:31:38,792][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:31:39,300][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:31:39,805][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:31:40,305][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:31:40,822][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:31:41,324][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:31:41,839][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:31:42,337][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:31:42,836][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:31:43,351][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:31:43,857][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:31:44,364][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:31:44,873][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:31:45,381][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:31:45,887][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:31:46,395][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:31:46,902][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:31:47,411][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:31:47,921][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10032 tokens. [2025-11-13 09:31:48,841][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.27%, Current % of VRAM taken: 58.52%, Block Peak % of device VRAM: 62.30%, ΔTime: 00:00:33 [2025-11-13 09:31:49,593][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:31:49,597][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:31:49,599][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:31:50,509][__main__][INFO] - Iteration 699 took 1m 9s (47.50% Gen, 51.20% Train). Generation: 33s, Training: 35s. Estimated remaining time: 46h 56m 38s. Estimated total time: 58h 18m 47s. Time estimates for 10 more iterations: 11m 39s, 100 more iterations: 1h 56m 37s, 500 more iterations: 9h 43m 7s. [2025-11-13 09:31:50,511][__main__][INFO] - Starting iteration 699. [2025-11-13 09:31:50,993][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 69 and human policies 1. [2025-11-13 09:31:50,994][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:32:16,635][__main__][INFO] - Number of regex retries in iteration 699: 0 [2025-11-13 09:32:16,636][__main__][INFO] - agents played in iteration 699 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:32:17,526][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:17,553][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:17,580][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:17,603][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.39%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:32:17,604][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:32:17,605][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:32:18,498][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:32:18,975][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:32:19,476][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:32:19,983][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:32:20,486][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:32:20,987][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:32:21,489][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:32:21,995][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:32:22,506][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:32:23,013][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:32:23,517][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:32:24,026][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:32:24,530][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:32:25,037][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:32:25,546][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:32:26,057][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:32:26,575][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:32:27,078][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:32:27,582][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:32:28,087][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:32:28,590][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:32:29,095][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:32:29,600][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:32:30,104][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:32:30,609][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:32:31,113][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:32:31,616][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:32:32,117][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:32:32,621][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:32:33,125][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:32:33,627][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:32:34,129][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:32:34,631][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:32:35,134][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:32:35,643][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:32:36,145][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:32:36,646][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:32:37,163][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:32:37,666][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:32:38,166][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:32:38,668][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:32:39,172][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:32:39,676][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:32:40,177][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:32:40,681][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:32:41,181][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:32:41,689][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:32:42,188][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:32:42,699][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:32:43,466][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:32:44,118][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:32:44,622][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:32:45,130][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:32:45,636][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:32:46,164][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:32:46,665][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:32:47,169][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:32:47,680][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:32:48,186][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:32:48,696][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:32:49,208][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:32:49,714][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:32:50,223][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:32:50,730][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:32:51,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10057 tokens. [2025-11-13 09:32:52,165][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.15%, ΔTime: 00:00:33 [2025-11-13 09:32:52,848][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:32:52,850][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:32:52,852][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:32:53,651][__main__][INFO] - Iteration 700 took 1m 2s (40.92% Gen, 57.80% Train). Generation: 25s, Training: 36s. Estimated remaining time: 40h 49m 41s. Estimated total time: 52h 12m 53s. Time estimates for 10 more iterations: 10m 26s, 100 more iterations: 1h 44m 25s, 500 more iterations: 8h 42m 8s. [2025-11-13 09:32:53,653][__main__][INFO] - Starting iteration 700. [2025-11-13 09:32:54,151][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 69 and human policies 1. [2025-11-13 09:32:54,151][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:33:29,879][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 0 balls Given our per-item values and Alice's per-item values, allocating all balls or all books to either party seems to maximize the points for both of us. However, since we both value books highly and there's no reason to assume Alice will propose all balls, splitting books and balls might be more strategic. But due to the symmetry and equal expected values, splitting all items could lead to a more stable outcome. Let's split the items: Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 09:33:30,903][__main__][INFO] - Number of regex retries in iteration 700: 1 [2025-11-13 09:33:30,903][__main__][INFO] - agents played in iteration 700 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:33:31,738][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:31,765][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:31,790][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:31,813][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:33:31,813][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:33:31,814][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:33:32,723][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:33:33,183][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:33:33,701][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:33:34,208][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:33:34,716][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:33:35,223][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:33:35,730][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:33:36,239][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:33:36,743][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:33:37,244][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:33:37,760][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:33:38,263][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:33:38,782][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:33:39,283][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:33:39,781][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:33:40,284][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:33:40,786][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:33:41,293][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:33:41,798][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:33:42,303][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:33:42,812][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:33:43,309][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:33:43,809][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:33:44,310][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:33:44,815][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:33:45,317][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:33:45,819][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:33:46,322][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:33:46,830][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:33:47,330][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:33:47,833][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:33:48,341][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:33:48,846][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:33:49,369][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:33:49,879][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:33:50,381][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:33:50,884][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:33:51,389][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:33:51,897][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:33:52,403][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:33:52,906][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:33:53,412][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:33:53,918][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:33:54,423][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:33:54,927][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:33:55,432][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:33:55,936][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:33:56,430][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:33:56,933][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:33:57,444][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:33:57,949][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:33:58,454][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:33:58,960][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:33:59,467][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:33:59,970][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:34:00,475][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:34:00,981][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:34:01,486][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:34:01,986][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:34:02,498][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:34:03,004][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:34:03,507][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:34:04,017][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:34:04,515][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:34:05,018][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9974 tokens. [2025-11-13 09:34:05,911][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.41%, ΔTime: 00:00:33 [2025-11-13 09:34:06,675][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:34:06,676][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:34:06,678][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:34:08,454][__main__][INFO] - Iteration 701 took 1m 14s (49.46% Gen, 48.15% Train). Generation: 36s, Training: 35s. Estimated remaining time: 50h 30m 45s. Estimated total time: 61h 55m 12s. Time estimates for 10 more iterations: 12m 23s, 100 more iterations: 2h 3m 50s, 500 more iterations: 10h 19m 12s. [2025-11-13 09:34:08,456][__main__][INFO] - Starting iteration 701. [2025-11-13 09:34:08,944][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 70 and human policies 1. [2025-11-13 09:34:08,945][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:34:38,080][__main__][INFO] - Number of regex retries in iteration 701: 0 [2025-11-13 09:34:38,081][__main__][INFO] - agents played in iteration 701 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:34:38,957][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:38,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:39,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:39,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:34:39,047][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:34:39,048][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:34:39,869][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:34:40,324][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:34:40,835][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:34:41,337][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:34:41,844][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:34:42,349][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:34:42,844][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:34:43,354][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:34:43,855][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:34:44,359][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:34:44,862][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:34:45,367][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:34:45,872][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:34:46,377][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:34:46,907][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:34:47,410][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:34:47,916][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:34:48,429][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:34:48,937][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:34:49,445][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:34:49,952][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:34:50,452][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:34:50,953][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:34:51,454][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:34:51,954][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:34:52,459][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:34:52,959][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:34:53,459][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:34:53,960][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:34:54,463][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:34:54,964][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:34:55,463][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:34:55,958][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:34:56,458][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:34:56,961][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:34:57,460][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:34:57,959][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:34:58,459][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:34:58,971][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:34:59,486][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:35:00,002][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:35:00,508][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:35:01,019][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:35:01,524][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:35:02,027][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:35:02,533][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:35:03,038][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:35:03,539][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:35:04,045][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:35:04,551][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:35:05,049][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:35:05,547][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:35:06,052][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:35:06,556][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:35:07,062][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:35:07,566][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:35:08,075][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:35:08,583][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:35:09,088][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:35:09,604][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:35:10,111][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:35:10,649][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:35:11,913][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:35:12,419][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:35:12,933][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10101 tokens. [2025-11-13 09:35:13,866][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.18%, Current % of VRAM taken: 58.43%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:34 [2025-11-13 09:35:14,553][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:35:14,555][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:35:14,556][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:35:15,421][__main__][INFO] - Iteration 702 took 1m 6s (43.83% Gen, 54.87% Train). Generation: 29s, Training: 36s. Estimated remaining time: 43h 58m 20s. Estimated total time: 55h 23m 54s. Time estimates for 10 more iterations: 11m 4s, 100 more iterations: 1h 50m 47s, 500 more iterations: 9h 13m 59s. [2025-11-13 09:35:15,425][__main__][INFO] - Starting iteration 702. [2025-11-13 09:35:15,911][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 70 and human policies 1. [2025-11-13 09:35:15,911][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:35:47,131][__main__][INFO] - Number of regex retries in iteration 702: 0 [2025-11-13 09:35:47,133][__main__][INFO] - agents played in iteration 702 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:35:47,966][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:47,995][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:48,022][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:48,045][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:35:48,045][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:35:48,046][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:35:48,876][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:35:49,335][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:35:49,843][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:35:50,349][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:35:50,852][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:35:51,361][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:35:51,867][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:35:52,375][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:35:52,880][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:35:53,381][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:35:53,883][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:35:54,384][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:35:54,888][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:35:55,401][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:35:55,901][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:35:56,411][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:35:56,910][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:35:57,418][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:35:57,950][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:35:58,460][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:35:58,972][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:35:59,474][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:35:59,976][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:36:00,483][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:36:00,983][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:36:01,485][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:36:01,994][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:36:02,500][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:36:03,026][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:36:03,536][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:36:04,046][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:36:04,557][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:36:05,063][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:36:05,569][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:36:06,081][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:36:06,586][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:36:07,091][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:36:07,597][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:36:08,103][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:36:08,615][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:36:09,124][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:36:09,636][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:36:10,142][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:36:10,645][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:36:11,152][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:36:11,655][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:36:12,161][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:36:12,667][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:36:13,170][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:36:13,672][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:36:14,174][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:36:14,683][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:36:15,184][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:36:15,688][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:36:16,190][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:36:16,697][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:36:17,197][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:36:17,700][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:36:18,201][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:36:18,701][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:36:19,200][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:36:19,700][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:36:20,204][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:36:20,703][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:36:21,206][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10108 tokens. [2025-11-13 09:36:22,053][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:33 [2025-11-13 09:36:22,839][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:36:22,840][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:36:22,842][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:36:23,923][__main__][INFO] - Iteration 703 took 1m 8s (45.91% Gen, 52.50% Train). Generation: 31s, Training: 35s. Estimated remaining time: 45h 13m 55s. Estimated total time: 56h 40m 37s. Time estimates for 10 more iterations: 11m 20s, 100 more iterations: 1h 53m 21s, 500 more iterations: 9h 26m 46s. [2025-11-13 09:36:23,925][__main__][INFO] - Starting iteration 703. [2025-11-13 09:36:24,400][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 70 and human policies 1. [2025-11-13 09:36:24,401][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:36:52,950][__main__][INFO] - Number of regex retries in iteration 703: 0 [2025-11-13 09:36:52,950][__main__][INFO] - agents played in iteration 703 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:36:53,767][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:53,795][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:53,820][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:53,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:36:53,844][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:36:53,845][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:36:54,670][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:36:55,127][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:36:55,634][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:36:56,139][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:36:56,646][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:36:57,147][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:36:57,643][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:36:58,144][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:36:58,646][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:36:59,146][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:36:59,656][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:37:00,160][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:37:00,662][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:37:01,170][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:37:01,671][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:37:02,173][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:37:02,677][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:37:03,182][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:37:03,683][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:37:04,185][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:37:04,685][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:37:05,186][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:37:05,688][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:37:06,199][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:37:06,701][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:37:07,207][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:37:07,711][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:37:08,215][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:37:08,721][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:37:09,226][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:37:09,729][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:37:10,250][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:37:10,749][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:37:11,256][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:37:11,759][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:37:12,262][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:37:12,767][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:37:13,271][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:37:13,780][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:37:14,287][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:37:14,796][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:37:15,305][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:37:15,810][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:37:16,318][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:37:16,823][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:37:17,328][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:37:17,832][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:37:18,336][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:37:18,838][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:37:19,356][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:37:19,859][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:37:20,363][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:37:20,862][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:37:21,366][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:37:21,860][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:37:22,355][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:37:22,856][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:37:23,368][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:37:23,870][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:37:24,372][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:37:24,881][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:37:25,382][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:37:25,888][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:37:26,387][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:37:26,889][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10042 tokens. [2025-11-13 09:37:27,732][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.12%, ΔTime: 00:00:33 [2025-11-13 09:37:28,510][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:37:28,512][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:37:28,513][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:37:29,464][__main__][INFO] - Iteration 704 took 1m 5s (43.88% Gen, 54.66% Train). Generation: 28s, Training: 35s. Estimated remaining time: 42h 45m 25s. Estimated total time: 54h 13m 13s. Time estimates for 10 more iterations: 10m 50s, 100 more iterations: 1h 48m 26s, 500 more iterations: 9h 2m 12s. [2025-11-13 09:37:29,468][__main__][INFO] - Starting iteration 704. [2025-11-13 09:37:29,957][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 70 and human policies 1. [2025-11-13 09:37:29,958][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:38:03,568][__main__][INFO] - Number of regex retries in iteration 704: 0 [2025-11-13 09:38:03,570][__main__][INFO] - agents played in iteration 704 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:38:04,456][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:04,491][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:04,519][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:04,545][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:38:04,545][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:38:04,546][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:38:05,462][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:38:05,916][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:38:06,437][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:38:06,936][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:38:07,435][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:38:07,950][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:38:08,460][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:38:08,959][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:38:09,468][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:38:09,976][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:38:10,483][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:38:10,989][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:38:11,495][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:38:12,001][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:38:12,508][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:38:13,029][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:38:13,534][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:38:14,042][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:38:14,547][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:38:15,056][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:38:15,560][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:38:16,067][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:38:16,578][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:38:17,087][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:38:17,592][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:38:18,089][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:38:18,603][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:38:19,111][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:38:19,618][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:38:20,123][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:38:20,630][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:38:21,132][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:38:21,638][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:38:22,148][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:38:22,664][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:38:23,170][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:38:23,678][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:38:24,183][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:38:24,691][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:38:25,197][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:38:25,706][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:38:26,212][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:38:26,722][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:38:27,226][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:38:27,724][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:38:28,223][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:38:28,722][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:38:29,222][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:38:29,724][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:38:30,228][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:38:30,726][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:38:31,231][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:38:31,731][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:38:32,230][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:38:32,730][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:38:33,230][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:38:33,731][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:38:34,230][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:38:34,727][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:38:35,231][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:38:35,736][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:38:36,235][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:38:36,735][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:38:37,233][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:38:37,732][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10004 tokens. [2025-11-13 09:38:38,590][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 12.77%, Current % of VRAM taken: 57.02%, Block Peak % of device VRAM: 62.23%, ΔTime: 00:00:33 [2025-11-13 09:38:39,218][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:38:39,219][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:38:39,221][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:38:40,229][__main__][INFO] - Iteration 705 took 1m 10s (47.83% Gen, 50.73% Train). Generation: 33s, Training: 35s. Estimated remaining time: 47h 4m 38s. Estimated total time: 58h 33m 37s. Time estimates for 10 more iterations: 11m 42s, 100 more iterations: 1h 57m 7s, 500 more iterations: 9h 45m 36s. [2025-11-13 09:38:40,232][__main__][INFO] - Starting iteration 705. [2025-11-13 09:38:40,720][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 70 and human policies 1. [2025-11-13 09:38:40,721][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:39:06,420][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 0 balls Given that the item quantities are 10 hats, 10 books, and 10 balls, and considering my per-item values for hats (10), books (1), and balls (1) are significantly higher than Bob's values, I propose taking all the hats. This strategy maximizes the potential points for this round by leveraging the high value I place on hats. Since Bob values hats at only 1, and I value them at 10, the proportional allocation in case of an overrun would still favor me by a large margin. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 09:39:12,263][__main__][INFO] - Number of regex retries in iteration 705: 1 [2025-11-13 09:39:12,264][__main__][INFO] - agents played in iteration 705 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:39:13,112][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 50.15%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:13,140][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 50.15%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:13,166][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 50.15%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:13,189][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 50.15%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:39:13,189][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:39:13,191][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:39:14,053][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:39:14,516][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:39:15,026][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:39:15,539][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:39:16,043][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:39:16,550][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:39:17,054][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:39:17,566][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:39:18,086][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:39:18,593][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:39:19,100][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:39:19,607][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:39:20,111][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:39:20,623][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:39:21,124][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:39:21,629][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:39:22,134][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:39:22,639][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:39:23,148][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:39:23,656][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:39:24,160][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:39:24,667][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:39:25,171][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:39:25,681][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:39:26,192][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:39:26,697][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:39:27,205][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:39:27,717][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:39:28,225][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:39:28,730][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:39:29,237][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:39:29,745][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:39:30,253][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:39:30,749][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:39:31,245][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:39:31,741][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:39:32,242][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:39:32,743][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:39:33,246][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:39:33,741][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:39:34,234][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:39:34,733][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:39:35,233][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:39:35,734][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:39:36,236][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:39:36,737][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:39:37,238][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:39:37,742][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:39:38,238][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:39:38,739][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:39:39,243][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:39:39,747][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:39:40,249][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:39:40,750][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:39:41,253][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:39:41,753][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:39:42,255][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:39:42,760][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:39:43,267][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:39:43,775][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:39:44,280][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:39:44,788][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:39:45,308][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:39:45,809][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:39:46,311][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10133 tokens. [2025-11-13 09:39:47,136][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.24%, Current % of VRAM taken: 58.48%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 09:39:47,860][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:39:47,862][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:39:47,864][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:39:48,811][__main__][INFO] - Iteration 706 took 1m 8s (46.32% Gen, 52.28% Train). Generation: 31s, Training: 35s. Estimated remaining time: 45h 14m 25s. Estimated total time: 56h 44m 32s. Time estimates for 10 more iterations: 11m 20s, 100 more iterations: 1h 53m 29s, 500 more iterations: 9h 27m 25s. [2025-11-13 09:39:48,813][__main__][INFO] - Starting iteration 706. [2025-11-13 09:39:49,281][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 70 and human policies 1. [2025-11-13 09:39:49,282][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:40:15,446][__main__][INFO] - Number of regex retries in iteration 706: 0 [2025-11-13 09:40:15,448][__main__][INFO] - agents played in iteration 706 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:40:16,362][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:16,404][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:16,439][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:16,467][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:40:16,468][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:40:16,469][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:40:17,421][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:40:17,885][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:40:18,397][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:40:18,907][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:40:19,416][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:40:19,921][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:40:20,429][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:40:20,938][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:40:21,445][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:40:21,953][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:40:22,464][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:40:22,970][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:40:23,481][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:40:23,988][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:40:24,499][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:40:25,006][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:40:25,513][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:40:26,025][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:40:26,529][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:40:27,034][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:40:27,541][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:40:28,045][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:40:28,551][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:40:29,057][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:40:29,556][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:40:30,064][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:40:30,570][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:40:31,078][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:40:31,586][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:40:32,091][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:40:32,594][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:40:33,102][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:40:33,606][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:40:34,104][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:40:34,605][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:40:35,123][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:40:35,622][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:40:36,122][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:40:36,622][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:40:37,123][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:40:37,625][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:40:38,129][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:40:38,636][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:40:39,143][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:40:39,648][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:40:40,149][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:40:40,654][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:40:41,156][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:40:41,656][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:40:42,160][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:40:42,662][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:40:43,167][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:40:43,669][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:40:44,170][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:40:44,672][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:40:45,175][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:40:45,678][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:40:46,178][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:40:46,680][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:40:47,182][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:40:47,686][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:40:48,200][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:40:48,702][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:40:49,206][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:40:49,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9975 tokens. [2025-11-13 09:40:50,568][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.19%, ΔTime: 00:00:33 [2025-11-13 09:40:51,218][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:40:51,221][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:40:51,222][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:40:52,167][__main__][INFO] - Iteration 707 took 1m 2s (41.61% Gen, 56.89% Train). Generation: 26s, Training: 35s. Estimated remaining time: 40h 53m 8s. Estimated total time: 52h 24m 18s. Time estimates for 10 more iterations: 10m 28s, 100 more iterations: 1h 44m 48s, 500 more iterations: 8h 44m 3s. [2025-11-13 09:40:52,169][__main__][INFO] - Starting iteration 707. [2025-11-13 09:40:52,663][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 70 and human policies 1. [2025-11-13 09:40:52,663][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:41:23,575][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 09:41:24,464][__main__][INFO] - Number of regex retries in iteration 707: 1 [2025-11-13 09:41:24,465][__main__][INFO] - agents played in iteration 707 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:41:25,331][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:25,354][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:25,377][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:25,399][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:41:25,400][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:41:25,401][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:41:26,295][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:41:26,757][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:41:27,270][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:41:27,766][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:41:28,276][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:41:28,773][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:41:29,272][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:41:29,783][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:41:30,294][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:41:30,798][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:41:31,303][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:41:31,804][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:41:32,308][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:41:32,812][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:41:33,314][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:41:33,812][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:41:34,330][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:41:34,840][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:41:35,347][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:41:35,853][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:41:36,359][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:41:36,870][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:41:37,377][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:41:37,883][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:41:38,393][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:41:38,893][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:41:39,398][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:41:39,907][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:41:40,412][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:41:40,926][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:41:41,427][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:41:41,926][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:41:42,430][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:41:42,933][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:41:43,439][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:41:43,945][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:41:44,445][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:41:44,949][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:41:45,450][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:41:45,950][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:41:46,455][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:41:46,959][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:41:47,464][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:41:47,964][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:41:48,465][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:41:48,965][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:41:49,466][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:41:49,971][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:41:50,475][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:41:50,981][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:41:51,476][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:41:51,980][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:41:52,480][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:41:52,988][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:41:53,493][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:41:53,993][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:41:54,504][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:41:55,007][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:41:55,528][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:41:56,029][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:41:56,522][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:41:57,021][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:41:57,516][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:41:58,014][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:41:58,513][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9992 tokens. [2025-11-13 09:41:59,336][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.01%, Current % of VRAM taken: 58.25%, Block Peak % of device VRAM: 62.18%, ΔTime: 00:00:33 [2025-11-13 09:42:00,063][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:42:00,065][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:42:00,067][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:42:01,066][__main__][INFO] - Iteration 708 took 1m 8s (46.49% Gen, 52.05% Train). Generation: 31s, Training: 35s. Estimated remaining time: 45h 27m 51s. Estimated total time: 57h 0m 10s. Time estimates for 10 more iterations: 11m 24s, 100 more iterations: 1h 54m 0s, 500 more iterations: 9h 30m 1s. [2025-11-13 09:42:01,068][__main__][INFO] - Starting iteration 708. [2025-11-13 09:42:01,541][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 70 and human policies 1. [2025-11-13 09:42:01,542][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:42:34,516][__main__][INFO] - Number of regex retries in iteration 708: 0 [2025-11-13 09:42:34,519][__main__][INFO] - agents played in iteration 708 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:42:35,412][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:35,444][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:35,471][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:35,496][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.25%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:42:35,496][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:42:35,498][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:42:36,411][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:42:36,873][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:42:37,378][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:42:37,888][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:42:38,397][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:42:38,924][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:42:39,423][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:42:39,936][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:42:40,442][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:42:40,939][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:42:41,449][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:42:41,951][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:42:42,450][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:42:42,952][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:42:43,454][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:42:43,961][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:42:44,468][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:42:44,972][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:42:45,478][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:42:45,986][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:42:46,489][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:42:46,994][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:42:47,498][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:42:48,002][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:42:48,507][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:42:49,009][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:42:49,518][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:42:50,021][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:42:50,536][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:42:51,042][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:42:51,542][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:42:52,051][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:42:52,550][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:42:53,054][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:42:53,559][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:42:54,059][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:42:54,563][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:42:55,063][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:42:55,564][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:42:56,066][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:42:56,567][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:42:57,063][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:42:57,564][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:42:58,068][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:42:58,574][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:42:59,076][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:42:59,576][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:43:00,078][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:43:00,579][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:43:01,074][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:43:01,577][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:43:02,072][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:43:02,573][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:43:03,076][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:43:03,578][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:43:04,076][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:43:04,580][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:43:05,084][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:43:05,586][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:43:06,086][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:43:06,590][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:43:07,093][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:43:07,597][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:43:08,102][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:43:08,598][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9865 tokens. [2025-11-13 09:43:09,531][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.81%, Current % of VRAM taken: 58.06%, Block Peak % of device VRAM: 62.14%, ΔTime: 00:00:33 [2025-11-13 09:43:10,253][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:43:10,260][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:43:10,263][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:43:11,244][__main__][INFO] - Iteration 709 took 1m 9s (47.31% Gen, 51.28% Train). Generation: 32s, Training: 35s. Estimated remaining time: 46h 31m 40s. Estimated total time: 58h 5m 10s. Time estimates for 10 more iterations: 11m 37s, 100 more iterations: 1h 56m 10s, 500 more iterations: 9h 40m 51s. [2025-11-13 09:43:11,247][__main__][INFO] - Starting iteration 709. [2025-11-13 09:43:11,738][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 70 and human policies 1. [2025-11-13 09:43:11,740][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:43:45,951][__main__][INFO] - Number of regex retries in iteration 709: 0 [2025-11-13 09:43:45,952][__main__][INFO] - agents played in iteration 709 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:43:46,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.22%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:46,801][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.22%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:46,824][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.22%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:46,847][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.22%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:43:46,847][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:43:46,848][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:43:47,764][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:43:48,228][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:43:48,733][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:43:49,241][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:43:49,747][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:43:50,251][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:43:50,750][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:43:51,264][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:43:51,774][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:43:52,278][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:43:52,784][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:43:53,287][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:43:53,789][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:43:54,285][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:43:54,788][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:43:55,293][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:43:55,796][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:43:56,296][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:43:56,802][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:43:57,307][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:43:57,808][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:43:58,310][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:43:58,811][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:43:59,339][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:43:59,845][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:44:00,349][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:44:00,864][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:44:01,366][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:44:01,871][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:44:02,370][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:44:02,873][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:44:03,375][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:44:03,879][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:44:04,388][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:44:04,898][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:44:05,404][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:44:05,908][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:44:06,411][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:44:06,916][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:44:07,421][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:44:07,932][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:44:08,435][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:44:08,943][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:44:09,444][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:44:09,952][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:44:10,452][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:44:10,969][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:44:11,472][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:44:11,972][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:44:12,481][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:44:12,988][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:44:13,501][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:44:14,008][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:44:14,514][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:44:15,023][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:44:15,525][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:44:16,030][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:44:16,537][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:44:17,038][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:44:17,557][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:44:18,061][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:44:18,566][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:44:19,075][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:44:19,580][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:44:20,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10003 tokens. [2025-11-13 09:44:21,027][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.26%, ΔTime: 00:00:33 [2025-11-13 09:44:21,812][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:44:21,813][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:44:21,815][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:44:22,758][__main__][INFO] - Iteration 710 took 1m 11s (48.17% Gen, 50.50% Train). Generation: 34s, Training: 35s. Estimated remaining time: 47h 36m 19s. Estimated total time: 59h 11m 1s. Time estimates for 10 more iterations: 11m 50s, 100 more iterations: 1h 58m 22s, 500 more iterations: 9h 51m 50s. [2025-11-13 09:44:22,760][__main__][INFO] - Starting iteration 710. [2025-11-13 09:44:23,232][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 70 and human policies 1. [2025-11-13 09:44:23,233][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:44:38,396][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 0 balls + 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 09:44:45,984][__main__][INFO] - Number of regex retries in iteration 710: 1 [2025-11-13 09:44:45,985][__main__][INFO] - agents played in iteration 710 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:44:46,821][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:46,844][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:46,866][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:46,889][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.33%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:44:46,889][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:44:46,890][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:44:47,764][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:44:48,224][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:44:48,738][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:44:49,248][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:44:49,755][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:44:50,268][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:44:50,774][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:44:51,278][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:44:51,784][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:44:52,291][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:44:52,790][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:44:53,291][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:44:53,796][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:44:54,295][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:44:54,798][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:44:55,303][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:44:55,803][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:44:56,305][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:44:56,805][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:44:57,312][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:44:57,811][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:44:58,316][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:44:58,823][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:44:59,325][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:44:59,833][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:45:00,340][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:45:00,856][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:45:01,362][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:45:01,865][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:45:02,366][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:45:02,876][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:45:03,375][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:45:03,876][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:45:04,377][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:45:04,885][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:45:05,386][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:45:05,910][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:45:06,410][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:45:06,908][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:45:07,409][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:45:07,920][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:45:08,420][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:45:08,930][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:45:09,433][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:45:09,940][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:45:10,443][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:45:10,943][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:45:11,461][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:45:11,960][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:45:12,465][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:45:12,971][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:45:13,476][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:45:13,987][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:45:14,492][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:45:14,996][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:45:15,502][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:45:16,007][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:45:16,511][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:45:17,014][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:45:17,517][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:45:18,024][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:45:18,523][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:45:19,025][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:45:19,540][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:45:20,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9817 tokens. [2025-11-13 09:45:20,965][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.07%, Current % of VRAM taken: 58.31%, Block Peak % of device VRAM: 62.34%, ΔTime: 00:00:33 [2025-11-13 09:45:21,737][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:45:21,738][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:45:21,740][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:45:23,588][__main__][INFO] - Iteration 711 took 1m 0s (37.69% Gen, 59.24% Train). Generation: 22s, Training: 35s. Estimated remaining time: 38h 42m 6s. Estimated total time: 50h 17m 48s. Time estimates for 10 more iterations: 10m 3s, 100 more iterations: 1h 40m 35s, 500 more iterations: 8h 22m 58s. [2025-11-13 09:45:23,590][__main__][INFO] - Starting iteration 711. [2025-11-13 09:45:24,070][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 71 and human policies 1. [2025-11-13 09:45:24,071][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:45:46,575][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 09:45:58,345][__main__][INFO] - Number of regex retries in iteration 711: 1 [2025-11-13 09:45:58,346][__main__][INFO] - agents played in iteration 711 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:45:59,192][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:59,215][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:59,238][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:59,260][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:45:59,261][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:45:59,262][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:46:00,186][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:46:00,650][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:46:01,157][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:46:01,664][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:46:02,167][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:46:02,671][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:46:03,173][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:46:03,676][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:46:04,192][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:46:04,694][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:46:05,199][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:46:05,707][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:46:06,212][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:46:06,713][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:46:07,211][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:46:07,712][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:46:08,225][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:46:08,732][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:46:09,246][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:46:09,752][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:46:10,260][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:46:10,779][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:46:11,286][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:46:11,791][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:46:12,303][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:46:12,807][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:46:13,313][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:46:13,817][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:46:14,324][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:46:14,830][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:46:15,331][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:46:15,834][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:46:16,333][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:46:16,839][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:46:17,356][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:46:17,860][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:46:18,368][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:46:18,871][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:46:19,374][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:46:19,885][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:46:20,391][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:46:20,889][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:46:21,391][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:46:21,897][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:46:22,405][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:46:22,911][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:46:23,417][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:46:23,921][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:46:24,426][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:46:24,937][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:46:25,445][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:46:25,950][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:46:26,470][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:46:26,977][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:46:27,482][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:46:27,990][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:46:28,492][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:46:28,998][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:46:29,506][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:46:30,012][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:46:30,524][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:46:31,029][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:46:31,538][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:46:32,048][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:46:32,552][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10002 tokens. [2025-11-13 09:46:33,452][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.22%, ΔTime: 00:00:33 [2025-11-13 09:46:34,202][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:46:34,204][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:46:34,205][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:46:35,118][__main__][INFO] - Iteration 712 took 1m 11s (48.24% Gen, 50.47% Train). Generation: 34s, Training: 35s. Estimated remaining time: 47h 35m 32s. Estimated total time: 59h 12m 26s. Time estimates for 10 more iterations: 11m 50s, 100 more iterations: 1h 58m 24s, 500 more iterations: 9h 52m 4s. [2025-11-13 09:46:35,120][__main__][INFO] - Starting iteration 712. [2025-11-13 09:46:35,609][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 71 and human policies 1. [2025-11-13 09:46:35,609][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:47:03,913][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 0 balls Given our values, it seems both Alice and Bob have a high value for balls. However, proposing to take all balls might not be the best strategy, as it could lead to a scenario where the other player also proposes to take all balls, resulting in no items for either player. A cooperative or competitive approach where we each take a share might be more beneficial in the long run. In this case, proposing to take none of the balls maintains a balanced approach. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 09:47:05,135][__main__][INFO] - Number of regex retries in iteration 712: 1 [2025-11-13 09:47:05,136][__main__][INFO] - agents played in iteration 712 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:47:05,922][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:05,945][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:05,967][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:05,989][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:47:05,990][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:47:05,991][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:47:06,827][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:47:07,289][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:47:07,794][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:47:08,302][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:47:08,808][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:47:09,317][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:47:09,817][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:47:10,322][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:47:10,826][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:47:11,329][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:47:11,834][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:47:12,337][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:47:12,838][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:47:13,347][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:47:13,852][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:47:14,363][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:47:14,872][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:47:15,377][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:47:15,897][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:47:16,403][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:47:16,927][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:47:17,436][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:47:17,943][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:47:18,458][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:47:18,961][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:47:19,469][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:47:19,983][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:47:20,481][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:47:20,978][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:47:21,476][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:47:21,978][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:47:22,497][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:47:22,995][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:47:23,498][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:47:24,005][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:47:24,509][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:47:25,019][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:47:25,526][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:47:26,029][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:47:26,534][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:47:27,032][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:47:27,535][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:47:28,042][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:47:28,543][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:47:29,049][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:47:29,555][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:47:30,064][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:47:30,570][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:47:31,076][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:47:31,594][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:47:32,095][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:47:32,600][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:47:33,104][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:47:33,604][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:47:34,108][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:47:34,601][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:47:35,103][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:47:35,605][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:47:36,106][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:47:36,609][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:47:37,109][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:47:37,610][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:47:38,112][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:47:38,612][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:47:39,114][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9811 tokens. [2025-11-13 09:47:39,957][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.11%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 61.98%, ΔTime: 00:00:33 [2025-11-13 09:47:40,709][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:47:40,710][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:47:40,712][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:47:41,668][__main__][INFO] - Iteration 713 took 1m 6s (44.70% Gen, 53.85% Train). Generation: 29s, Training: 35s. Estimated remaining time: 43h 25m 0s. Estimated total time: 55h 3m 0s. Time estimates for 10 more iterations: 11m 0s, 100 more iterations: 1h 50m 6s, 500 more iterations: 9h 10m 30s. [2025-11-13 09:47:41,670][__main__][INFO] - Starting iteration 713. [2025-11-13 09:47:42,156][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 71 and human policies 1. [2025-11-13 09:47:42,156][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:48:03,096][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 09:48:15,079][__main__][INFO] - Number of regex retries in iteration 713: 1 [2025-11-13 09:48:15,079][__main__][INFO] - agents played in iteration 713 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:48:15,951][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:15,984][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:16,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:16,036][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:48:16,037][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:48:16,037][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:48:16,925][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:48:17,395][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:48:17,904][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:48:18,414][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:48:18,922][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:48:19,430][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:48:19,945][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:48:20,450][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:48:20,956][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:48:21,469][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:48:21,980][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:48:22,495][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:48:22,996][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:48:23,497][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:48:24,002][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:48:24,511][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:48:25,010][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:48:25,517][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:48:26,022][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:48:26,528][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:48:27,033][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:48:27,540][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:48:28,048][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:48:28,555][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:48:29,062][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:48:29,569][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:48:30,075][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:48:30,583][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:48:31,091][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:48:31,597][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:48:32,101][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:48:32,600][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:48:33,113][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:48:33,615][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:48:34,117][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:48:34,628][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:48:35,132][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:48:35,633][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:48:36,134][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:48:36,633][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:48:37,139][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:48:37,648][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:48:38,151][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:48:38,649][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:48:39,152][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:48:39,651][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:48:40,158][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:48:40,655][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:48:41,170][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:48:41,674][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:48:42,193][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:48:42,696][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:48:43,203][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:48:43,718][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:48:44,223][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:48:44,725][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:48:45,226][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:48:45,728][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:48:46,231][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:48:46,732][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:48:47,232][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:48:47,733][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:48:48,239][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:48:48,742][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:48:49,243][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9864 tokens. [2025-11-13 09:48:50,099][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.14%, ΔTime: 00:00:33 [2025-11-13 09:48:50,792][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:48:50,794][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:48:50,796][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:48:51,660][__main__][INFO] - Iteration 714 took 1m 9s (47.37% Gen, 51.39% Train). Generation: 32s, Training: 35s. Estimated remaining time: 46h 16m 4s. Estimated total time: 57h 55m 14s. Time estimates for 10 more iterations: 11m 35s, 100 more iterations: 1h 55m 50s, 500 more iterations: 9h 39m 12s. [2025-11-13 09:48:51,662][__main__][INFO] - Starting iteration 714. [2025-11-13 09:48:52,146][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 71 and human policies 1. [2025-11-13 09:48:52,147][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:49:24,804][__main__][INFO] - Number of regex retries in iteration 714: 0 [2025-11-13 09:49:24,805][__main__][INFO] - agents played in iteration 714 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:49:25,681][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:25,710][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:25,736][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:25,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:49:25,761][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:49:25,761][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:49:26,666][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:49:27,127][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:49:27,637][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:49:28,151][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:49:28,662][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:49:29,169][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:49:29,680][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:49:30,187][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:49:30,712][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:49:31,217][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:49:31,723][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:49:32,233][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:49:32,748][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:49:33,259][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:49:33,765][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:49:34,272][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:49:34,780][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:49:35,287][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:49:35,791][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:49:36,300][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:49:36,805][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:49:37,315][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:49:37,826][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:49:38,330][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:49:38,841][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:49:39,346][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:49:39,852][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:49:40,357][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:49:40,862][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:49:41,383][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:49:41,882][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:49:42,396][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:49:42,900][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:49:43,408][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:49:43,917][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:49:44,416][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:49:44,917][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:49:45,420][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:49:45,922][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:49:46,434][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:49:46,933][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:49:47,440][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:49:47,943][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:49:48,449][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:49:48,956][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:49:49,461][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:49:49,964][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:49:50,467][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:49:50,970][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:49:51,477][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:49:51,978][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:49:52,480][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:49:52,986][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:49:53,492][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:49:53,998][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:49:54,501][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:49:55,004][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:49:55,518][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:49:56,024][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:49:56,526][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:49:57,030][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:49:57,531][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:49:58,033][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:49:58,545][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:49:59,048][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10030 tokens. [2025-11-13 09:49:59,915][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.34%, Block Peak % of device VRAM: 62.17%, ΔTime: 00:00:33 [2025-11-13 09:50:00,666][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:50:00,669][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:50:00,671][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:50:01,623][__main__][INFO] - Iteration 715 took 1m 9s (47.00% Gen, 51.62% Train). Generation: 32s, Training: 35s. Estimated remaining time: 46h 13m 34s. Estimated total time: 57h 53m 54s. Time estimates for 10 more iterations: 11m 34s, 100 more iterations: 1h 55m 47s, 500 more iterations: 9h 38m 59s. [2025-11-13 09:50:01,625][__main__][INFO] - Starting iteration 715. [2025-11-13 09:50:02,127][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 71 and human policies 1. [2025-11-13 09:50:02,129][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:50:36,292][__main__][INFO] - Number of regex retries in iteration 715: 0 [2025-11-13 09:50:36,295][__main__][INFO] - agents played in iteration 715 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:50:37,178][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:37,202][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:37,229][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:37,252][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:50:37,253][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:50:37,253][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:50:38,205][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:50:38,675][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:50:39,186][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:50:39,687][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:50:40,195][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:50:40,702][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:50:41,211][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:50:41,732][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:50:42,245][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:50:42,751][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:50:43,257][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:50:43,764][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:50:44,277][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:50:44,781][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:50:45,290][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:50:45,796][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:50:46,301][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:50:46,808][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:50:47,315][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:50:47,824][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:50:48,331][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:50:48,842][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:50:49,349][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:50:49,853][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:50:50,359][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:50:50,870][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:50:51,376][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:50:51,882][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:50:52,400][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:50:52,905][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:50:53,426][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:50:53,932][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:50:54,442][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:50:54,949][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:50:55,456][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:50:55,964][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:50:56,462][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:50:56,958][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:50:57,461][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:50:57,964][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:50:58,460][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:50:58,961][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:50:59,466][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:50:59,968][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:51:00,469][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:51:00,972][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:51:01,478][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:51:01,980][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:51:02,486][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:51:02,993][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:51:03,502][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:51:04,008][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:51:04,512][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:51:05,024][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:51:05,530][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:51:06,037][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:51:06,544][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:51:07,050][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:51:07,556][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:51:08,054][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:51:08,552][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:51:09,060][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:51:09,568][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:51:10,067][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:51:10,563][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9993 tokens. [2025-11-13 09:51:11,410][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.93%, Current % of VRAM taken: 58.18%, Block Peak % of device VRAM: 62.08%, ΔTime: 00:00:33 [2025-11-13 09:51:12,105][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:51:12,106][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:51:12,108][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:51:13,004][__main__][INFO] - Iteration 716 took 1m 10s (48.20% Gen, 50.53% Train). Generation: 34s, Training: 35s. Estimated remaining time: 47h 22m 21s. Estimated total time: 59h 3m 52s. Time estimates for 10 more iterations: 11m 48s, 100 more iterations: 1h 58m 7s, 500 more iterations: 9h 50m 38s. [2025-11-13 09:51:13,006][__main__][INFO] - Starting iteration 716. [2025-11-13 09:51:13,508][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 71 and human policies 1. [2025-11-13 09:51:13,508][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:51:36,089][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 10 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 09:51:47,129][__main__][INFO] - Number of regex retries in iteration 716: 1 [2025-11-13 09:51:47,130][__main__][INFO] - agents played in iteration 716 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:51:47,987][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:48,012][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:48,047][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:48,071][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:51:48,071][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:51:48,072][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:51:49,006][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:51:49,467][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:51:49,972][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:51:50,482][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:51:51,004][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:51:51,508][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:51:52,008][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:51:52,506][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:51:53,006][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:51:53,525][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:51:54,057][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:51:54,562][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:51:55,069][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:51:55,575][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:51:56,083][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:51:56,588][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:51:57,099][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:51:57,614][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:51:58,122][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:51:58,630][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:51:59,144][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:51:59,651][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:52:00,152][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:52:00,660][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:52:01,168][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:52:01,673][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:52:02,175][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:52:02,679][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:52:03,184][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:52:03,687][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:52:04,196][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:52:04,702][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:52:05,204][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:52:05,707][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:52:06,209][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:52:06,727][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:52:07,234][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:52:07,737][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:52:08,240][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:52:08,742][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:52:09,250][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:52:09,761][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:52:10,268][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:52:10,775][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:52:11,282][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:52:11,785][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:52:12,286][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:52:12,790][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:52:13,305][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:52:13,805][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:52:14,307][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:52:14,810][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:52:15,315][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:52:15,822][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:52:16,323][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:52:16,825][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:52:17,336][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:52:17,839][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:52:18,342][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:52:18,842][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:52:19,346][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:52:19,858][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:52:20,361][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:52:20,863][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:52:21,367][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10062 tokens. [2025-11-13 09:52:22,200][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.19%, Current % of VRAM taken: 58.44%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 09:52:22,993][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:52:22,996][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:52:22,998][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:52:23,879][__main__][INFO] - Iteration 717 took 1m 10s (47.78% Gen, 50.97% Train). Generation: 33s, Training: 35s. Estimated remaining time: 46h 55m 52s. Estimated total time: 58h 38m 34s. Time estimates for 10 more iterations: 11m 43s, 100 more iterations: 1h 57m 17s, 500 more iterations: 9h 46m 25s. [2025-11-13 09:52:23,881][__main__][INFO] - Starting iteration 717. [2025-11-13 09:52:24,374][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 71 and human policies 1. [2025-11-13 09:52:24,374][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:52:45,646][__main__][INFO] - Number of regex retries in iteration 717: 0 [2025-11-13 09:52:45,646][__main__][INFO] - agents played in iteration 717 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:52:46,457][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:46,481][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:46,506][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:46,528][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:52:46,529][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:52:46,530][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:52:47,465][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:52:48,029][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:52:49,453][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:52:49,959][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:52:50,466][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:52:50,977][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:52:51,483][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:52:51,984][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:52:52,492][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:52:53,005][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:52:53,511][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:52:54,025][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:52:54,531][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:52:55,035][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:52:55,545][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:52:56,049][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:52:56,552][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:52:57,058][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:52:57,566][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:52:58,066][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:52:58,570][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:52:59,074][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:52:59,589][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:53:00,096][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:53:00,608][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:53:01,112][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:53:01,616][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:53:02,118][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:53:02,625][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:53:03,128][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:53:03,633][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:53:04,130][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:53:04,639][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:53:05,145][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:53:05,652][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:53:06,186][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:53:06,694][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:53:07,204][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:53:07,708][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:53:08,212][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:53:08,722][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:53:09,227][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:53:09,736][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:53:10,237][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:53:10,739][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:53:11,243][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:53:11,749][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:53:12,255][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:53:12,762][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:53:13,262][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:53:13,761][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:53:14,258][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:53:14,758][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:53:15,254][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:53:15,754][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:53:16,249][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:53:16,741][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:53:17,236][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:53:17,740][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:53:18,246][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:53:18,751][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:53:19,250][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:53:19,751][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:53:20,255][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:53:20,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9919 tokens. [2025-11-13 09:53:21,605][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:34 [2025-11-13 09:53:22,244][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:53:22,246][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:53:22,248][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:53:23,095][__main__][INFO] - Iteration 718 took 58s (36.23% Gen, 62.33% Train). Generation: 21s, Training: 36s. Estimated remaining time: 37h 12m 24s. Estimated total time: 48h 56m 5s. Time estimates for 10 more iterations: 9m 47s, 100 more iterations: 1h 37m 52s, 500 more iterations: 8h 9m 20s. [2025-11-13 09:53:23,098][__main__][INFO] - Starting iteration 718. [2025-11-13 09:53:23,582][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 71 and human policies 1. [2025-11-13 09:53:23,583][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:53:57,166][__main__][INFO] - Number of regex retries in iteration 718: 0 [2025-11-13 09:53:57,167][__main__][INFO] - agents played in iteration 718 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:53:57,982][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:58,010][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:58,033][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:58,056][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:53:58,057][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:53:58,058][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:53:58,981][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:53:59,442][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:53:59,961][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:54:00,463][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:54:00,969][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:54:01,478][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:54:01,986][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:54:02,494][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:54:03,001][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:54:03,511][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:54:04,017][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:54:04,526][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:54:05,039][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:54:05,546][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:54:06,058][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:54:06,567][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:54:07,072][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:54:07,576][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:54:08,085][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:54:08,589][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:54:09,085][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:54:09,586][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:54:10,082][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:54:10,583][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:54:11,086][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:54:11,587][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:54:12,091][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:54:12,592][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:54:13,100][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:54:13,605][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:54:14,107][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:54:14,612][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:54:15,116][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:54:15,617][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:54:16,118][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:54:16,621][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:54:17,124][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:54:17,628][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:54:18,131][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:54:18,637][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:54:19,140][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:54:19,642][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:54:20,139][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:54:20,643][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:54:21,152][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:54:21,656][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:54:22,164][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:54:22,665][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:54:23,173][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:54:23,670][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:54:24,170][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:54:24,667][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:54:25,167][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:54:25,692][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:54:26,193][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:54:26,697][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:54:27,202][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:54:27,729][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:54:28,235][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:54:28,736][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:54:29,241][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:54:29,738][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:54:30,250][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:54:30,749][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:54:31,251][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10157 tokens. [2025-11-13 09:54:32,091][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.16%, ΔTime: 00:00:33 [2025-11-13 09:54:32,853][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:54:32,856][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:54:32,858][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:54:33,860][__main__][INFO] - Iteration 719 took 1m 10s (47.79% Gen, 50.78% Train). Generation: 33s, Training: 35s. Estimated remaining time: 46h 49m 3s. Estimated total time: 58h 33m 55s. Time estimates for 10 more iterations: 11m 42s, 100 more iterations: 1h 57m 7s, 500 more iterations: 9h 45m 39s. [2025-11-13 09:54:33,862][__main__][INFO] - Starting iteration 719. [2025-11-13 09:54:34,360][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 71 and human policies 1. [2025-11-13 09:54:34,361][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:55:05,428][__main__][INFO] - Number of regex retries in iteration 719: 0 [2025-11-13 09:55:05,430][__main__][INFO] - agents played in iteration 719 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:55:06,296][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:06,330][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:06,360][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:06,385][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.34%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:55:06,386][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:55:06,387][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:55:07,340][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:55:07,900][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:55:08,416][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:55:08,926][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:55:09,435][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:55:09,940][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:55:10,447][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:55:10,964][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:55:11,480][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:55:11,992][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:55:12,504][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:55:13,014][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:55:13,539][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:55:14,050][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:55:14,559][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:55:15,069][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:55:15,575][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:55:16,083][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:55:16,592][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:55:17,100][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:55:17,615][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:55:18,121][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:55:18,628][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:55:19,134][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:55:19,639][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:55:20,145][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:55:20,647][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:55:21,149][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:55:21,645][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:55:22,149][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:55:22,660][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:55:23,161][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:55:23,663][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:55:24,165][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:55:24,666][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:55:25,169][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:55:25,670][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:55:26,173][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:55:26,676][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:55:27,177][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:55:27,682][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:55:28,181][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:55:28,686][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:55:29,187][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:55:29,683][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:55:30,186][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:55:30,690][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:55:31,193][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:55:31,696][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:55:32,195][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:55:32,687][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:55:33,187][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:55:33,686][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:55:34,192][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:55:34,689][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:55:35,188][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:55:35,690][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:55:36,193][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:55:36,693][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:55:37,193][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:55:37,689][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:55:38,191][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:55:38,687][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:55:39,184][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:55:39,687][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10041 tokens. [2025-11-13 09:55:40,554][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.32%, ΔTime: 00:00:33 [2025-11-13 09:55:41,196][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:55:41,198][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:55:41,199][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:55:42,343][__main__][INFO] - Iteration 720 took 1m 7s (45.70% Gen, 52.61% Train). Generation: 31s, Training: 35s. Estimated remaining time: 44h 53m 9s. Estimated total time: 56h 39m 9s. Time estimates for 10 more iterations: 11m 19s, 100 more iterations: 1h 53m 18s, 500 more iterations: 9h 26m 31s. [2025-11-13 09:55:42,345][__main__][INFO] - Starting iteration 720. [2025-11-13 09:55:42,855][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 71 and human policies 1. [2025-11-13 09:55:42,855][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:56:10,357][__main__][INFO] - Number of regex retries in iteration 720: 0 [2025-11-13 09:56:10,357][__main__][INFO] - agents played in iteration 720 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:56:11,160][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:11,185][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:11,209][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:11,233][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.37%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:56:11,233][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:56:11,234][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:56:12,160][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:56:12,620][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:56:13,127][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:56:13,646][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:56:14,157][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:56:14,685][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:56:15,188][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:56:15,698][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:56:16,199][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:56:16,706][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:56:17,207][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:56:17,708][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:56:18,214][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:56:18,719][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:56:19,225][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:56:19,729][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:56:20,232][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:56:20,737][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:56:21,260][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:56:21,763][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:56:22,264][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:56:22,767][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:56:23,272][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:56:23,777][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:56:24,278][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:56:24,784][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:56:25,286][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:56:25,790][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:56:26,288][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:56:26,787][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:56:27,287][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:56:27,790][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:56:28,289][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:56:28,790][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:56:29,292][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:56:29,790][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:56:30,292][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:56:30,793][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:56:31,291][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:56:31,796][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:56:32,297][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:56:32,805][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:56:33,306][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:56:33,813][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:56:34,316][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:56:34,822][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:56:35,316][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:56:35,818][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:56:36,323][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:56:36,823][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:56:37,321][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:56:37,822][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:56:38,330][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:56:38,833][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:56:39,327][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:56:39,823][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:56:40,325][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:56:40,835][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:56:41,336][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:56:41,836][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:56:42,336][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:56:42,840][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:56:43,354][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:56:43,856][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:56:44,356][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9964 tokens. [2025-11-13 09:56:45,220][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.10%, Current % of VRAM taken: 58.35%, Block Peak % of device VRAM: 62.16%, ΔTime: 00:00:33 [2025-11-13 09:56:46,157][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:56:46,159][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:56:46,173][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:56:48,554][__main__][INFO] - Iteration 721 took 1m 5s (41.86% Gen, 54.52% Train). Generation: 27s, Training: 35s. Estimated remaining time: 42h 57m 51s. Estimated total time: 54h 44m 58s. Time estimates for 10 more iterations: 10m 56s, 100 more iterations: 1h 49m 29s, 500 more iterations: 9h 7m 29s. [2025-11-13 09:56:48,556][__main__][INFO] - Starting iteration 721. [2025-11-13 09:56:49,046][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 72 and human policies 1. [2025-11-13 09:56:49,047][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:57:11,435][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 10 books, 0 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 09:57:23,769][__main__][INFO] - Number of regex retries in iteration 721: 1 [2025-11-13 09:57:23,771][__main__][INFO] - agents played in iteration 721 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:57:24,666][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:24,708][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:24,741][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:24,768][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:57:24,768][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:57:24,769][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:57:25,629][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:57:26,106][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:57:26,617][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:57:27,123][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:57:27,627][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:57:28,132][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:57:28,636][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:57:29,145][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:57:29,653][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:57:30,160][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:57:30,664][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:57:31,167][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:57:31,668][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:57:32,184][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:57:32,686][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:57:33,189][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:57:33,694][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:57:34,199][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:57:34,703][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:57:35,207][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:57:35,711][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:57:36,218][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:57:36,786][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:57:37,290][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:57:37,797][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:57:38,291][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:57:38,799][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:57:39,301][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:57:39,798][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:57:40,296][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:57:40,796][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:57:41,291][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:57:41,795][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:57:42,298][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:57:42,797][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:57:43,290][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:57:43,789][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:57:44,289][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:57:44,787][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:57:45,294][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:57:45,793][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:57:46,292][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:57:46,796][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:57:47,294][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:57:47,794][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:57:48,283][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:57:48,782][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:57:49,288][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:57:49,787][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:57:50,287][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:57:50,788][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:57:51,291][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:57:51,791][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:57:52,291][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:57:52,795][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:57:53,300][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:57:53,803][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:57:54,306][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:57:54,805][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:57:55,312][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:57:55,817][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:57:56,324][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:57:56,829][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:57:57,336][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:57:57,835][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10017 tokens. [2025-11-13 09:57:58,692][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.05%, Current % of VRAM taken: 58.30%, Block Peak % of device VRAM: 62.29%, ΔTime: 00:00:33 [2025-11-13 09:57:59,346][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:57:59,348][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:57:59,350][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:58:00,229][__main__][INFO] - Iteration 722 took 1m 11s (48.78% Gen, 49.98% Train). Generation: 34s, Training: 35s. Estimated remaining time: 47h 30m 51s. Estimated total time: 59h 19m 9s. Time estimates for 10 more iterations: 11m 51s, 100 more iterations: 1h 58m 38s, 500 more iterations: 9h 53m 11s. [2025-11-13 09:58:00,231][__main__][INFO] - Starting iteration 722. [2025-11-13 09:58:00,699][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 72 and human policies 1. [2025-11-13 09:58:00,700][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:58:32,128][__main__][INFO] - Number of regex retries in iteration 722: 0 [2025-11-13 09:58:32,129][__main__][INFO] - agents played in iteration 722 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:58:32,987][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:33,011][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:33,037][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:33,060][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:58:33,060][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:58:33,061][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:58:33,990][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:58:34,450][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:58:34,966][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:58:35,486][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:58:35,990][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:58:36,494][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:58:37,005][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:58:37,510][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:58:38,012][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:58:38,513][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:58:39,024][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:58:39,529][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:58:40,034][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:58:40,534][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:58:41,044][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:58:41,549][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:58:42,052][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:58:42,556][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:58:43,059][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:58:43,565][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:58:44,069][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:58:44,572][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:58:45,069][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:58:45,569][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:58:46,086][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:58:46,581][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:58:47,084][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:58:47,597][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:58:48,096][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:58:48,599][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:58:49,100][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:58:49,593][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:58:50,098][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:58:50,601][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:58:51,102][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:58:51,606][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:58:52,109][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:58:52,610][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:58:53,112][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:58:53,615][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:58:54,117][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:58:54,617][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:58:55,121][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:58:55,625][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 09:58:56,125][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 09:58:56,629][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 09:58:57,132][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 09:58:57,631][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 09:58:58,135][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 09:58:58,635][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 09:58:59,139][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 09:58:59,641][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 09:59:00,144][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 09:59:00,647][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 09:59:01,147][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 09:59:01,647][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 09:59:02,150][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 09:59:02,652][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 09:59:03,145][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 09:59:03,643][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 09:59:04,137][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 09:59:04,634][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 09:59:05,126][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 09:59:05,623][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 09:59:06,123][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10124 tokens. [2025-11-13 09:59:07,051][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.96%, Current % of VRAM taken: 58.21%, Block Peak % of device VRAM: 62.18%, ΔTime: 00:00:33 [2025-11-13 09:59:07,914][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 09:59:07,916][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 09:59:07,918][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 09:59:08,931][__main__][INFO] - Iteration 723 took 1m 8s (46.06% Gen, 52.45% Train). Generation: 31s, Training: 35s. Estimated remaining time: 45h 2m 8s. Estimated total time: 56h 51m 36s. Time estimates for 10 more iterations: 11m 22s, 100 more iterations: 1h 53m 43s, 500 more iterations: 9h 28m 36s. [2025-11-13 09:59:08,933][__main__][INFO] - Starting iteration 723. [2025-11-13 09:59:09,446][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 72 and human policies 1. [2025-11-13 09:59:09,447][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 09:59:36,525][__main__][INFO] - Number of regex retries in iteration 723: 0 [2025-11-13 09:59:36,526][__main__][INFO] - agents played in iteration 723 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 09:59:37,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:37,365][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:37,393][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:37,430][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.28%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 09:59:37,431][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 09:59:37,432][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 09:59:38,267][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 09:59:38,723][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 09:59:39,231][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 09:59:39,736][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 09:59:40,244][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 09:59:40,745][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 09:59:41,248][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 09:59:41,752][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 09:59:42,254][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 09:59:42,758][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 09:59:43,254][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 09:59:43,750][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 09:59:44,250][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 09:59:44,748][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 09:59:45,244][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 09:59:45,738][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 09:59:46,236][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 09:59:46,741][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 09:59:47,237][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 09:59:47,730][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 09:59:48,232][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 09:59:48,730][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 09:59:49,226][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 09:59:49,720][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 09:59:50,219][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 09:59:50,720][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 09:59:51,227][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 09:59:51,724][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 09:59:52,218][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 09:59:52,717][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 09:59:53,223][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 09:59:53,724][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 09:59:54,226][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 09:59:54,731][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 09:59:55,234][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 09:59:55,736][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 09:59:56,245][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 09:59:56,747][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 09:59:57,250][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 09:59:57,748][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 09:59:58,277][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 09:59:58,785][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 09:59:59,285][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 09:59:59,796][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 10:00:00,295][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 10:00:00,798][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 10:00:01,306][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 10:00:01,805][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 10:00:02,307][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 10:00:02,806][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 10:00:03,303][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 10:00:03,802][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 10:00:04,306][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 10:00:04,806][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 10:00:05,315][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 10:00:05,812][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 10:00:06,316][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 10:00:06,814][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 10:00:07,316][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 10:00:07,820][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 10:00:08,322][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 10:00:08,825][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 10:00:09,333][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 10:00:09,840][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 10:00:10,345][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9951 tokens. [2025-11-13 10:00:11,257][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.22%, Current % of VRAM taken: 58.46%, Block Peak % of device VRAM: 62.37%, ΔTime: 00:00:32 [2025-11-13 10:00:13,967][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:00:13,969][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:00:13,971][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:00:14,852][__main__][INFO] - Iteration 724 took 1m 5s (41.40% Gen, 57.25% Train). Generation: 27s, Training: 37s. Estimated remaining time: 42h 39m 45s. Estimated total time: 54h 30m 18s. Time estimates for 10 more iterations: 10m 54s, 100 more iterations: 1h 49m 0s, 500 more iterations: 9h 5m 3s. [2025-11-13 10:00:14,854][__main__][INFO] - Starting iteration 724. [2025-11-13 10:00:15,344][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 72 and human policies 1. [2025-11-13 10:00:15,344][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:00:44,218][__main__][INFO] - Number of regex retries in iteration 724: 0 [2025-11-13 10:00:44,218][__main__][INFO] - agents played in iteration 724 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 10:00:45,101][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:45,150][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:45,188][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:45,220][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.41%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:00:45,221][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:00:45,222][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:00:46,128][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 10:00:46,590][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 10:00:47,098][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 10:00:47,600][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 10:00:48,106][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 10:00:48,625][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 10:00:49,127][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 10:00:49,633][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 10:00:50,140][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 10:00:50,643][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 10:00:51,144][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 10:00:51,644][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 10:00:52,147][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 10:00:52,652][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 10:00:53,154][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 10:00:53,657][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 10:00:54,157][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 10:00:54,656][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 10:00:55,158][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 10:00:55,661][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 10:00:56,163][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 10:00:56,664][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 10:00:57,164][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 10:00:57,666][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 10:00:58,171][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 10:00:58,675][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 10:00:59,189][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 10:00:59,690][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 10:01:00,204][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 10:01:00,707][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 10:01:01,209][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 10:01:01,725][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 10:01:02,228][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 10:01:02,735][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 10:01:03,238][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 10:01:03,743][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 10:01:04,249][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 10:01:04,751][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 10:01:05,252][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 10:01:05,759][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 10:01:06,261][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 10:01:06,764][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 10:01:07,259][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 10:01:07,758][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 10:01:08,259][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 10:01:08,760][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 10:01:09,260][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 10:01:09,764][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 10:01:10,266][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 10:01:10,765][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 10:01:11,267][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 10:01:11,766][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 10:01:12,270][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 10:01:12,770][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 10:01:13,271][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 10:01:13,775][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 10:01:14,278][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 10:01:14,771][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 10:01:15,264][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 10:01:15,760][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 10:01:16,259][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 10:01:16,763][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 10:01:17,259][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 10:01:17,755][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 10:01:18,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10126 tokens. [2025-11-13 10:01:19,181][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.21%, Current % of VRAM taken: 58.45%, Block Peak % of device VRAM: 62.47%, ΔTime: 00:00:33 [2025-11-13 10:01:19,963][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:01:19,965][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:01:19,966][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:01:20,927][__main__][INFO] - Iteration 725 took 1m 5s (44.03% Gen, 54.51% Train). Generation: 28s, Training: 35s. Estimated remaining time: 42h 47m 33s. Estimated total time: 54h 39m 12s. Time estimates for 10 more iterations: 10m 55s, 100 more iterations: 1h 49m 18s, 500 more iterations: 9h 6m 32s. [2025-11-13 10:01:20,930][__main__][INFO] - Starting iteration 725. [2025-11-13 10:01:21,416][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 72 and human policies 1. [2025-11-13 10:01:21,417][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:01:43,421][__main__][INFO] - Number of regex retries in iteration 725: 0 [2025-11-13 10:01:43,422][__main__][INFO] - agents played in iteration 725 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 10:01:44,282][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:44,309][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:44,336][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:44,360][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.36%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:01:44,361][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:01:44,362][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:01:45,244][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 10:01:45,707][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 10:01:46,216][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 10:01:46,716][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 10:01:47,226][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 10:01:47,736][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 10:01:48,238][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 10:01:48,736][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 10:01:49,239][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 10:01:49,744][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 10:01:50,251][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 10:01:50,758][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 10:01:51,266][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 10:01:51,771][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 10:01:52,274][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 10:01:52,777][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 10:01:53,281][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 10:01:53,785][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 10:01:54,295][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 10:01:54,801][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 10:01:55,308][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 10:01:55,813][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 10:01:56,309][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 10:01:56,814][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 10:01:57,316][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 10:01:57,820][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 10:01:58,336][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 10:01:58,839][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 10:01:59,347][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 10:01:59,846][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 10:02:00,348][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 10:02:00,846][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 10:02:01,346][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 10:02:01,840][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 10:02:02,343][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 10:02:02,837][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 10:02:03,337][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 10:02:03,833][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 10:02:04,335][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 10:02:04,836][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 10:02:05,328][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 10:02:05,829][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 10:02:06,332][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 10:02:06,836][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 10:02:07,341][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 10:02:07,845][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 10:02:08,348][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 10:02:08,852][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 10:02:09,356][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 10:02:09,859][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 10:02:10,357][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 10:02:10,856][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 10:02:11,354][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 10:02:11,848][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 10:02:12,343][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 10:02:12,845][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 10:02:13,347][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 10:02:13,853][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 10:02:14,354][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 10:02:14,855][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 10:02:15,358][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 10:02:15,861][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 10:02:16,364][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 10:02:16,865][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 10:02:17,366][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9919 tokens. [2025-11-13 10:02:18,215][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.12%, Current % of VRAM taken: 58.36%, Block Peak % of device VRAM: 62.33%, ΔTime: 00:00:32 [2025-11-13 10:02:18,997][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:02:18,999][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:02:19,000][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:02:20,031][__main__][INFO] - Iteration 726 took 58s (37.54% Gen, 60.70% Train). Generation: 22s, Training: 35s. Estimated remaining time: 36h 58m 7s. Estimated total time: 48h 50m 45s. Time estimates for 10 more iterations: 9m 46s, 100 more iterations: 1h 37m 41s, 500 more iterations: 8h 8m 27s. [2025-11-13 10:02:20,033][__main__][INFO] - Starting iteration 726. [2025-11-13 10:02:20,521][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 72 and human policies 1. [2025-11-13 10:02:20,521][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:02:43,842][__main__][INFO] - Number of regex retries in iteration 726: 0 [2025-11-13 10:02:43,843][__main__][INFO] - agents played in iteration 726 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 10:02:44,704][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:44,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:44,750][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:44,773][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.31%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:02:44,774][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:02:44,775][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:02:45,693][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 10:02:46,157][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 10:02:46,672][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 10:02:47,177][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 10:02:47,685][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 10:02:48,201][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 10:02:48,707][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 10:02:49,215][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 10:02:49,724][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 10:02:50,228][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 10:02:50,747][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 10:02:51,251][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 10:02:51,756][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 10:02:52,267][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 10:02:52,773][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 10:02:53,282][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 10:02:53,792][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 10:02:54,290][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 10:02:54,789][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 10:02:55,296][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 10:02:55,802][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 10:02:56,310][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 10:02:56,814][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 10:02:57,320][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 10:02:57,821][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 10:02:58,323][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 10:02:58,828][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 10:02:59,330][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 10:02:59,833][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 10:03:00,336][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 10:03:00,842][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 10:03:01,357][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 10:03:01,854][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 10:03:02,355][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 10:03:02,854][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 10:03:03,349][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 10:03:03,856][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 10:03:04,358][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 10:03:04,859][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 10:03:05,361][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 10:03:05,866][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 10:03:06,367][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 10:03:06,868][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 10:03:07,371][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 10:03:07,874][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 10:03:08,378][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 10:03:08,877][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 10:03:09,386][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 10:03:09,888][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 10:03:10,393][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 10:03:10,893][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 10:03:11,393][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 10:03:11,897][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 10:03:12,401][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 10:03:12,902][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 10:03:13,405][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 10:03:13,905][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 10:03:14,411][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 10:03:14,909][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 10:03:15,408][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 10:03:15,917][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 10:03:16,422][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 10:03:16,924][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 10:03:17,426][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 10:03:17,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10047 tokens. [2025-11-13 10:03:18,762][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.08%, Current % of VRAM taken: 58.33%, Block Peak % of device VRAM: 62.19%, ΔTime: 00:00:33 [2025-11-13 10:03:19,510][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:03:19,512][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:03:19,513][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:03:20,426][__main__][INFO] - Iteration 727 took 59s (38.93% Gen, 59.54% Train). Generation: 23s, Training: 35s. Estimated remaining time: 38h 1m 38s. Estimated total time: 49h 55m 17s. Time estimates for 10 more iterations: 9m 59s, 100 more iterations: 1h 39m 50s, 500 more iterations: 8h 19m 12s. [2025-11-13 10:03:20,428][__main__][INFO] - Starting iteration 727. [2025-11-13 10:03:20,897][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 72 and human policies 1. [2025-11-13 10:03:20,897][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:03:35,949][mllm.models.large_language_model_local][WARNING] - Response Proposal: 10 hats, 0 books, 10 balls did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 10:03:46,936][__main__][INFO] - Number of regex retries in iteration 727: 1 [2025-11-13 10:03:46,937][__main__][INFO] - agents played in iteration 727 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 10:03:47,737][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:47,760][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:47,783][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:47,806][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.32%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:03:47,807][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:03:47,807][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:03:48,739][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 10:03:49,201][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 10:03:49,709][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 10:03:50,212][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 10:03:50,716][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 10:03:51,217][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 10:03:51,723][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 10:03:52,229][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 10:03:52,737][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 10:03:53,245][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 10:03:53,758][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 10:03:54,266][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 10:03:54,772][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 10:03:55,280][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 10:03:55,786][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 10:03:56,291][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 10:03:56,799][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 10:03:57,310][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 10:03:57,823][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 10:03:58,353][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 10:03:58,868][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 10:03:59,373][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 10:03:59,884][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 10:04:00,392][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 10:04:00,900][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 10:04:01,404][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 10:04:01,914][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 10:04:02,415][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 10:04:02,918][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 10:04:03,419][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 10:04:03,923][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 10:04:04,425][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 10:04:04,927][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 10:04:05,428][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 10:04:05,923][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 10:04:06,427][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 10:04:06,931][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 10:04:07,431][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 10:04:07,928][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 10:04:08,440][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 10:04:08,942][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 10:04:09,454][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 10:04:09,953][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 10:04:10,457][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 10:04:10,960][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 10:04:11,464][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 10:04:11,968][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 10:04:12,469][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 10:04:12,968][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 10:04:13,476][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 10:04:13,982][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 10:04:14,486][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 10:04:14,982][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 10:04:15,475][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 10:04:15,968][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 10:04:16,464][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 10:04:16,967][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 10:04:17,472][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 10:04:17,974][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 10:04:18,475][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 10:04:18,977][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 10:04:19,477][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 10:04:19,979][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 10:04:20,478][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 10:04:20,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9956 tokens. [2025-11-13 10:04:21,808][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.14%, Current % of VRAM taken: 58.38%, Block Peak % of device VRAM: 62.38%, ΔTime: 00:00:33 [2025-11-13 10:04:22,567][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:04:22,569][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:04:22,571][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:04:23,475][__main__][INFO] - Iteration 728 took 1m 2s (41.61% Gen, 56.94% Train). Generation: 26s, Training: 35s. Estimated remaining time: 40h 14m 16s. Estimated total time: 52h 8m 58s. Time estimates for 10 more iterations: 10m 25s, 100 more iterations: 1h 44m 17s, 500 more iterations: 8h 41m 29s. [2025-11-13 10:04:23,477][__main__][INFO] - Starting iteration 728. [2025-11-13 10:04:23,978][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 72 and human policies 1. [2025-11-13 10:04:23,978][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:04:52,665][__main__][INFO] - Number of regex retries in iteration 728: 0 [2025-11-13 10:04:52,667][__main__][INFO] - agents played in iteration 728 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 10:04:53,567][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:53,606][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:53,634][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:53,658][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.35%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:04:53,658][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:04:53,659][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:04:54,612][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 10:04:55,080][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 10:04:55,593][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 10:04:56,100][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 10:04:56,608][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 10:04:57,115][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 10:04:57,621][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 10:04:58,128][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 10:04:58,636][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 10:04:59,142][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 10:04:59,661][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 10:05:00,162][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 10:05:00,667][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 10:05:01,176][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 10:05:01,680][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 10:05:02,195][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 10:05:02,701][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 10:05:03,209][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 10:05:03,708][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 10:05:04,204][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 10:05:04,710][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 10:05:05,211][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 10:05:05,713][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 10:05:06,216][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 10:05:06,722][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 10:05:07,235][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 10:05:07,739][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 10:05:08,250][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 10:05:08,773][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 10:05:09,277][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 10:05:09,781][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 10:05:10,287][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 10:05:10,790][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 10:05:11,294][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 10:05:11,791][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 10:05:12,286][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 10:05:12,794][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 10:05:13,289][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 10:05:13,794][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 10:05:14,291][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 10:05:14,785][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 10:05:15,296][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 10:05:15,797][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 10:05:16,299][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 10:05:16,797][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 10:05:17,299][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 10:05:17,803][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 10:05:18,305][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 10:05:18,807][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 10:05:19,312][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 10:05:19,816][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 10:05:20,320][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 10:05:20,825][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 10:05:21,324][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 10:05:21,827][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 10:05:22,331][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 10:05:22,833][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 10:05:23,333][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 10:05:23,834][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 10:05:24,338][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 10:05:24,839][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 10:05:25,343][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 10:05:25,859][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 10:05:26,364][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 10:05:26,882][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 10009 tokens. [2025-11-13 10:05:27,734][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.25%, Current % of VRAM taken: 58.49%, Block Peak % of device VRAM: 62.21%, ΔTime: 00:00:33 [2025-11-13 10:05:28,386][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:05:28,388][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:05:28,389][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:05:29,196][__main__][INFO] - Iteration 729 took 1m 5s (43.99% Gen, 54.77% Train). Generation: 28s, Training: 35s. Estimated remaining time: 42h 25m 9s. Estimated total time: 54h 20m 56s. Time estimates for 10 more iterations: 10m 52s, 100 more iterations: 1h 48m 41s, 500 more iterations: 9h 3m 29s. [2025-11-13 10:05:29,198][__main__][INFO] - Starting iteration 729. [2025-11-13 10:05:29,685][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 72 and human policies 1. [2025-11-13 10:05:29,686][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:06:03,125][__main__][INFO] - Number of regex retries in iteration 729: 0 [2025-11-13 10:06:03,126][__main__][INFO] - agents played in iteration 729 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 10:06:03,980][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:04,004][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:04,029][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:04,052][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.38%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:06:04,053][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:06:04,054][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:06:04,959][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 10:06:05,424][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 10:06:05,940][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 10:06:06,451][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 10:06:06,960][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 10:06:07,467][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 10:06:07,977][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 10:06:08,478][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 10:06:08,997][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 10:06:09,502][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 10:06:10,008][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 10:06:10,517][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 10:06:11,017][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 10:06:11,522][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 10:06:12,027][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 10:06:12,536][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 10:06:13,043][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 10:06:13,547][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 10:06:14,051][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 10:06:14,550][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 10:06:15,052][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 10:06:15,554][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 10:06:16,058][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 10:06:16,557][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 10:06:17,072][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 10:06:17,573][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 10:06:18,074][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 10:06:18,575][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 10:06:19,078][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 10:06:19,598][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 10:06:20,104][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 10:06:20,606][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 10:06:21,107][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 10:06:21,608][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 10:06:22,117][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 10:06:22,619][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 10:06:23,123][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 10:06:23,629][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 10:06:24,133][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 10:06:24,638][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 10:06:25,145][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 10:06:25,648][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 10:06:26,150][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 10:06:26,648][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 10:06:27,149][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 10:06:27,649][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 10:06:28,149][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 10:06:28,650][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 10:06:29,150][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 10:06:29,649][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 10:06:30,153][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 10:06:30,650][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 10:06:31,145][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 10:06:31,643][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 10:06:32,143][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 10:06:32,645][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 10:06:33,145][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 10:06:33,638][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 10:06:34,129][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 10:06:34,619][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 10:06:35,116][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 10:06:35,606][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 10:06:36,100][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 10:06:36,592][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 10:06:37,084][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9939 tokens. [2025-11-13 10:06:37,938][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 11.90%, Current % of VRAM taken: 56.15%, Block Peak % of device VRAM: 62.25%, ΔTime: 00:00:32 [2025-11-13 10:06:38,662][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:06:38,664][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:06:38,665][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:06:39,576][__main__][INFO] - Iteration 730 took 1m 9s (47.84% Gen, 50.85% Train). Generation: 33s, Training: 35s. Estimated remaining time: 46h 17m 38s. Estimated total time: 58h 14m 36s. Time estimates for 10 more iterations: 11m 38s, 100 more iterations: 1h 56m 29s, 500 more iterations: 9h 42m 26s. [2025-11-13 10:06:39,578][__main__][INFO] - Starting iteration 730. [2025-11-13 10:06:40,062][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 72 and human policies 1. [2025-11-13 10:06:40,062][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:07:04,442][__main__][INFO] - Number of regex retries in iteration 730: 0 [2025-11-13 10:07:04,443][__main__][INFO] - agents played in iteration 730 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 10:07:05,250][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 49.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:05,450][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 49.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:05,476][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 49.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:05,533][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 49.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:07:05,534][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:07:05,534][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:07:07,133][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 10:07:07,687][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 10:07:08,196][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 10:07:08,707][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 10:07:09,221][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 10:07:09,727][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 10:07:10,234][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 10:07:10,741][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 10:07:11,247][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 10:07:11,748][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 10:07:12,252][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 10:07:12,754][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 10:07:13,262][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 10:07:13,769][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 10:07:14,277][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 10:07:14,784][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 10:07:15,284][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 10:07:15,795][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 10:07:16,292][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 10:07:16,810][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 10:07:17,308][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 10:07:17,810][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 10:07:18,308][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 10:07:18,812][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 10:07:19,323][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 10:07:19,831][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 10:07:20,335][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 10:07:20,849][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 10:07:21,357][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 10:07:21,866][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 10:07:22,367][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 10:07:22,869][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 10:07:23,373][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 10:07:23,875][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 10:07:24,379][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 10:07:24,886][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 10:07:25,391][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 10:07:25,905][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 10:07:26,406][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 10:07:26,908][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 10:07:27,411][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 10:07:27,914][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 10:07:28,425][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 10:07:28,930][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 10:07:29,436][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 10:07:29,943][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 10:07:30,446][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 10:07:30,949][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 10:07:31,452][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 10:07:31,955][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 10:07:32,466][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 10:07:32,970][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 10:07:33,472][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 10:07:33,971][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 10:07:34,475][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 10:07:34,983][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 10:07:35,489][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 10:07:35,990][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 10:07:36,501][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 10:07:36,998][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 10:07:37,505][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 10:07:38,002][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 10:07:38,506][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 10:07:39,022][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 10:07:39,519][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9998 tokens. [2025-11-13 10:07:40,398][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 13.96%, Current % of VRAM taken: 58.20%, Block Peak % of device VRAM: 62.27%, ΔTime: 00:00:33 [2025-11-13 10:07:41,037][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:07:41,039][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:07:41,041][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:07:43,099][__main__][INFO] - Iteration 731 took 1m 3s (38.67% Gen, 58.06% Train). Generation: 24s, Training: 36s. Estimated remaining time: 40h 33m 51s. Estimated total time: 52h 31m 53s. Time estimates for 10 more iterations: 10m 30s, 100 more iterations: 1h 45m 3s, 500 more iterations: 8h 45m 18s. [2025-11-13 10:07:43,101][__main__][INFO] - Starting iteration 731. [2025-11-13 10:07:43,584][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 73 and human policies 1. [2025-11-13 10:07:43,585][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2025-11-13 10:08:14,816][mllm.models.large_language_model_local][WARNING] - Response Proposal: 0 hats, 0 books, 0 balls Note: This proposal is based on the values given, where both you and Alice place significant value on books and balls, respectively. However, since we have no proposed values for hats from either of us, and considering the round dynamics, proposing to take all books and balls might maximize the points. But given the previous round results and the values assigned, it might be strategic to check if there's any value in taking some hats as well. However, for simplicity and based on the values provided, I'm sticking with the conservative approach. did not match regex: (?i)Proposal:\s*((?:\s*(?P(10|[0-9]))\s*(?Phats?|books?|balls?)\s*,?)+), retry 1/3 [2025-11-13 10:08:16,863][__main__][INFO] - Number of regex retries in iteration 731: 1 [2025-11-13 10:08:16,864][__main__][INFO] - agents played in iteration 731 are Alice, Bob_buffer, Alice_buffer, Bob [2025-11-13 10:08:17,699][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:17,727][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:17,755][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:17,778][mllm.training.trainer_ad_align][INFO] - For task: Get advantages with critic gradient accumulation, ΔVRAM % (total): 0.00%, Current % of VRAM taken: 51.30%, Block Peak % of device VRAM: 19.52%, ΔTime: 00:00:00 [2025-11-13 10:08:17,779][mllm.training.trainer_ad_align][INFO] - Sharing advantage alignment data. [2025-11-13 10:08:17,781][mllm.training.trainer_ad_align][INFO] - Receiving advantage packets. [2025-11-13 10:08:18,689][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 64 [2025-11-13 10:08:19,148][mllm.training.trainer_common][INFO] - Processing mini-batch 1 of 64 [2025-11-13 10:08:19,651][mllm.training.trainer_common][INFO] - Processing mini-batch 2 of 64 [2025-11-13 10:08:20,151][mllm.training.trainer_common][INFO] - Processing mini-batch 3 of 64 [2025-11-13 10:08:20,651][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 64 [2025-11-13 10:08:21,151][mllm.training.trainer_common][INFO] - Processing mini-batch 5 of 64 [2025-11-13 10:08:21,648][mllm.training.trainer_common][INFO] - Processing mini-batch 6 of 64 [2025-11-13 10:08:22,146][mllm.training.trainer_common][INFO] - Processing mini-batch 7 of 64 [2025-11-13 10:08:22,653][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 64 [2025-11-13 10:08:23,163][mllm.training.trainer_common][INFO] - Processing mini-batch 9 of 64 [2025-11-13 10:08:23,678][mllm.training.trainer_common][INFO] - Processing mini-batch 10 of 64 [2025-11-13 10:08:24,186][mllm.training.trainer_common][INFO] - Processing mini-batch 11 of 64 [2025-11-13 10:08:24,700][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 64 [2025-11-13 10:08:25,206][mllm.training.trainer_common][INFO] - Processing mini-batch 13 of 64 [2025-11-13 10:08:25,708][mllm.training.trainer_common][INFO] - Processing mini-batch 14 of 64 [2025-11-13 10:08:26,227][mllm.training.trainer_common][INFO] - Processing mini-batch 15 of 64 [2025-11-13 10:08:26,732][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 64 [2025-11-13 10:08:27,232][mllm.training.trainer_common][INFO] - Processing mini-batch 17 of 64 [2025-11-13 10:08:27,728][mllm.training.trainer_common][INFO] - Processing mini-batch 18 of 64 [2025-11-13 10:08:28,225][mllm.training.trainer_common][INFO] - Processing mini-batch 19 of 64 [2025-11-13 10:08:28,725][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 64 [2025-11-13 10:08:29,232][mllm.training.trainer_common][INFO] - Processing mini-batch 21 of 64 [2025-11-13 10:08:29,735][mllm.training.trainer_common][INFO] - Processing mini-batch 22 of 64 [2025-11-13 10:08:30,244][mllm.training.trainer_common][INFO] - Processing mini-batch 23 of 64 [2025-11-13 10:08:30,744][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 64 [2025-11-13 10:08:31,251][mllm.training.trainer_common][INFO] - Processing mini-batch 25 of 64 [2025-11-13 10:08:31,757][mllm.training.trainer_common][INFO] - Processing mini-batch 26 of 64 [2025-11-13 10:08:32,262][mllm.training.trainer_common][INFO] - Processing mini-batch 27 of 64 [2025-11-13 10:08:32,768][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 64 [2025-11-13 10:08:33,272][mllm.training.trainer_common][INFO] - Processing mini-batch 29 of 64 [2025-11-13 10:08:33,798][mllm.training.trainer_common][INFO] - Processing mini-batch 30 of 64 [2025-11-13 10:08:34,304][mllm.training.trainer_common][INFO] - Processing mini-batch 31 of 64 [2025-11-13 10:08:34,811][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 64 [2025-11-13 10:08:35,321][mllm.training.trainer_common][INFO] - Processing mini-batch 33 of 64 [2025-11-13 10:08:35,824][mllm.training.trainer_common][INFO] - Processing mini-batch 34 of 64 [2025-11-13 10:08:36,329][mllm.training.trainer_common][INFO] - Processing mini-batch 35 of 64 [2025-11-13 10:08:36,827][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 64 [2025-11-13 10:08:37,332][mllm.training.trainer_common][INFO] - Processing mini-batch 37 of 64 [2025-11-13 10:08:37,826][mllm.training.trainer_common][INFO] - Processing mini-batch 38 of 64 [2025-11-13 10:08:38,322][mllm.training.trainer_common][INFO] - Processing mini-batch 39 of 64 [2025-11-13 10:08:38,827][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 64 [2025-11-13 10:08:39,338][mllm.training.trainer_common][INFO] - Processing mini-batch 41 of 64 [2025-11-13 10:08:39,842][mllm.training.trainer_common][INFO] - Processing mini-batch 42 of 64 [2025-11-13 10:08:40,337][mllm.training.trainer_common][INFO] - Processing mini-batch 43 of 64 [2025-11-13 10:08:40,841][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 64 [2025-11-13 10:08:41,339][mllm.training.trainer_common][INFO] - Processing mini-batch 45 of 64 [2025-11-13 10:08:41,836][mllm.training.trainer_common][INFO] - Processing mini-batch 46 of 64 [2025-11-13 10:08:42,333][mllm.training.trainer_common][INFO] - Processing mini-batch 47 of 64 [2025-11-13 10:08:42,832][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 64 [2025-11-13 10:08:43,344][mllm.training.trainer_common][INFO] - Processing mini-batch 49 of 64 [2025-11-13 10:08:43,850][mllm.training.trainer_common][INFO] - Processing mini-batch 50 of 64 [2025-11-13 10:08:44,352][mllm.training.trainer_common][INFO] - Processing mini-batch 51 of 64 [2025-11-13 10:08:44,858][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 64 [2025-11-13 10:08:45,361][mllm.training.trainer_common][INFO] - Processing mini-batch 53 of 64 [2025-11-13 10:08:45,865][mllm.training.trainer_common][INFO] - Processing mini-batch 54 of 64 [2025-11-13 10:08:46,367][mllm.training.trainer_common][INFO] - Processing mini-batch 55 of 64 [2025-11-13 10:08:46,869][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 64 [2025-11-13 10:08:47,374][mllm.training.trainer_common][INFO] - Processing mini-batch 57 of 64 [2025-11-13 10:08:47,874][mllm.training.trainer_common][INFO] - Processing mini-batch 58 of 64 [2025-11-13 10:08:48,373][mllm.training.trainer_common][INFO] - Processing mini-batch 59 of 64 [2025-11-13 10:08:48,880][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 64 [2025-11-13 10:08:49,376][mllm.training.trainer_common][INFO] - Processing mini-batch 61 of 64 [2025-11-13 10:08:49,874][mllm.training.trainer_common][INFO] - Processing mini-batch 62 of 64 [2025-11-13 10:08:50,378][mllm.training.trainer_common][INFO] - Processing mini-batch 63 of 64 [2025-11-13 10:08:50,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 9845 tokens. [2025-11-13 10:08:51,730][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 14.05%, Current % of VRAM taken: 58.30%, Block Peak % of device VRAM: 62.24%, ΔTime: 00:00:33 [2025-11-13 10:08:52,490][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/policy_optimizer_state.pt [2025-11-13 10:08:52,492][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/critic_optimizer_state.pt [2025-11-13 10:08:52,494][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/m/muqeeth/llm_negotiation/2025_11/no_press_10_1_ties_ad_align_nocurrtimestep_seed9999/seed_9999/agent_trainer/trainer_annealing_state.pkl [2025-11-13 10:08:53,441][__main__][INFO] - Iteration 732 took 1m 9s (47.64% Gen, 51.00% Train). Generation: 33s, Training: 35s. Estimated remaining time: 46h 13m 41s. Estimated total time: 58h 12m 53s. Time estimates for 10 more iterations: 11m 38s, 100 more iterations: 1h 56m 25s, 500 more iterations: 9h 42m 8s. [2025-11-13 10:08:53,443][__main__][INFO] - Starting iteration 732. [2025-11-13 10:08:53,944][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 73 and human policies 1. [2025-11-13 10:08:53,944][__main__][INFO] - Hard coded buffer agents are set to False with prob 0